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I . INTRODUCTION 



The Fast Fourier Transform (FFT) has become popular 
because it computes the Discrete Fourier Transform (DFT) 
much faster than direct evaluation of the DFT definition. 
While evaluation of the DFT from the definition requires 
N^ complex multiplications, the FFT algorithm produces 
the same result with the number of multiplications 
proportional to N log 2 N. The DFT can be expressed as a 
vector-matrix equation: 

T1 W 

X(k) = W" x(n), n,k=0, 1 , . . . ,N-1 (1.1) 

where W = exp [ -j 27T/N ] . (1.2) 

The Nxl column vectors x(n) and X(k) are the input sequence 

and its DFT, respectively. The record length, N, is assumed 

n k 

to be a power of 2. The NxN matrix W represents the DFT 
operation on the input data sequence that yields the 
transformed output sequence. This thesis uses matrix 
factorization of the DFT matrix to describe a parallel 
multi-processor implementation of the FFT. After a detailed 
investigation into the properties of the necessary inter- 
processor data transfers, the Fast Hartley Transform (FHT) 
and the Walsh-Hadamar d Transform (WHT) are also investigated 
for compatibility with the parallel-processing network. 



7 



A. BACKGROUND 



The Cooley-Tukey FFT algorithm can be conveniently 
viewed as a factorization of the DFT matrix which it 
implements . Many authors have used this approach to 
illustrate the computational savings afforded by the 
FFT [Refs. 1-5]. Each factor of the f actored-matrix 
representation corresponds to one butterfly stage of the 
FFT. The product of matrix factors is equivalent to 
evaluating the DFT by its definition, though the rows or 
columns of the FFT matrix product are reordered in 
comparison to the DFT matrix. The row (column) reordering 
corresponds to permutation of the output (input) sequence 
ordering, respectively. The keys to the speed advantage 
of the FFT are that the matrix factors are sparse 
[Ref. 3: p. 328], and that the complex weighting factors 
are periodic [Ref. 4: p. 358]. These properties permit 
the calculation of the DFT coefficients to be carried out 
iteratively. 

A more recent approach to enhance signal processing 
of large data sets is parallel processing. This concept 
was used to develop the Distributed Signal Processor (DISP) 
at MIT Lincoln Laboratory [Ref. 6]. The DISP uses nodal 
processors interconnected by a Butterfly Network to perform 
real-time signal processing. The input data set is first 
divided equally among the processors. Then all processors 
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simultaneously perform the same operations on different 
segments of data. Finally, all processors simultaneously 
exchange partially-processed data through the Butterfly 
Network. The last two steps are repeated until the task is 
completed. Since many signal processing applications involve 
convolution and FFT algorithms, the inter-processor network 
was selected for efficient handling of these algorithms. 

The Butterfly Network allows all nodal processors to 
simultaneously transfer the necessary data among themselves 
without conflict [Ref. 6: p. 1]. References 6-10 discuss 
in detail the architecture of the DISP and the properties 
of the Butterfly Network. It is assumed that an FFT 
algorithm is carried out in parallel by P active processors 
to compute the DFT of a set of N data points. It is 
assumed that N and P are integer powers of 2, N>2P 
[Ref. 7; p.3]. Therefore, each processor contains 
M=N/P points, where M is also a power of 2. 

A new algorithm for the computation of power spectra 
is the Fast Hartley Transform (FHT) introduced by 
Bracewell. Requiring only real arithmetic computations, 
the FHT performs even faster than the FFT [Refs. 11-12]. 

The Walsh-Hadamard Transform (WHT) , on the other hand, 

uses Walsh functions (square waves) as its basis 

functions. Since the Walsh functions are square waves, 

they take on only two values, namely +1 or -1 [Ref. 13; p. 225]. 
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Both the FHT and WHT are studied for parallel implementation 
using the Butterfly Network. 

B. RESEARCH OBJECTIVES 

The primary objective of this research is to investigate 
the suitability of the Butterfly Network for mul ti -processor 
implementation of fast transform algorithms. This thesis 
develops a matrix representation of the multi-processor 
FFT (MPFFT) suitable for implementation on the Distributed 
Signal Processor. The MPFFT algorithm is based on a radix-2, 
in-place, decimation-in-time (DIT) algorithm with input in 
bit-reversed order [Ref. 6: p. 28]. It is felt that the 
matrix representation provides insight into the output 
reordering caused by the required transfers on the Butterfly 
Network. It will be shown that the transfer patterns can 
be represented by permutations of an NxN identity matrix. 
This makes the "transfer matrices" easy to generate, and 
leads to a description of the output reordering as 
a function of the number of active processors. After the 
transfer patterns and properties of the transfer matrices 
are examined in detail, the FHT and WHT algorithms are 
investigated for potential support by the Butterfly Network. 

To permit concentration on the matrix manipulations, all 
normalizing factors of (1/N) are neglected. With the DFT 
definition used here, a factor of (1/N) is required in the 
definition of the inverse DFT. Other fast transforms also 
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require normalizing factors in either the forward or 
inverse transform definitions. This factor, in either 
direction, can easily be inserted, as necessary, as a 
gain multiplying each output element. In the MPFFT matrix 
representations of Section II , the column indices are 
repeatedly referred to as time indices. While this 
terminology is correct for the first stage, it is simply 
a convenience when referring to the partially-processed 
data at the second and successive computation stages. This 
terminology is used because it is felt that it helps rather 
than hinders the discussion. 

The multi-processor FFT is presented in Section II as 
a partitioning of the single-processor DIT FFT. The 
required inter-procesor data transfers are described by 
matrices, and their properties are examined in detail. An 
algorithm is presented for generating the transfer matrices 
as permutations of identity matrices. The output reordering 
caused by the transfers is then discussed, and an algorithm 
is developed for determining output ordering as a function 
of N and P. Alternative transfer patterns are investigated, 
and the pattern chosen in reference 6 is justified. The 
decimation-in-frequency (DIF) FFT is examined, and a 
general matrix equation for the MPFFT is developed. 

Section III investigates several other fast transforms 
for compatibility with the Butterfly Network. The 
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relatively new Fast Hartley Transform (FHT) is partitioned 
for parallel processing using the Distributed Signal 
Processor. Finally, the Walsh and Hadamard orderings of 
the Walsh-Hadamard Transform (WT) are studied. Section 
IV presents a unified approach to the algorithms for 
transfer matrix generation and corresponding output 
reordering, and also shows how the two separate algorithms 
from Section II can be combined into a single algorithm. 
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II. FFT MATRIX FACTORIZATIONS 



This section introduces shorthand notation to simplify 
the matrix manipulations to follow, and presents a repre- 
sentation of a distributed FFT algorithm using partitioned, 
factored matrices. The matrix version of the distributed 
FFT will be shown to involve pe'rra.u.tatipns of identity 
matrices to describe the necessary data transfers. These 
"transfer matrices" possess many interesting properties 
that will be discussed in detail. 

Many authors have used matrices to describe FFT 
algorithms [Ref. 1: pp. 148-149 and 2: pp. 63-69]. In fact, 
the initial derivation of the FFT depended on the fact that 
an NxN matrix, none of whose elements is zero, can be 
factored into sparse matrices with many zeros [Ref. 3: p. 297]. 
A Discrete Fourier Transform defined by 

N-1 

X(k) = Z x(n) exp [ -j ( 2 Tf/N) nk ] k = 0 , 1 , . . . , N-1 (2.1) 

n=0 

can be represented as a vector matrix equation 
E 

X(k) = W' x(n) , (2.2) 

where W = exp ( -j 2 ir/N) , and E is a matrix of exponents. The 
input sequence is the vector x(n), given by 
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(2.3) 



x(n) = [x(0) x(l) ... x(N-l)] 



T 



and the transformed output sequence X(k) is 



X(k) = [X(0) X(l) ... X(N-l) ]^. 



(2.4) 



The vectors x(n) and X(k) may represent sequences in either 
normal order, or in digit-reversed order. In several forms of 
the radix-2 FFT algorithms a normal-ordered input 
data sequence produces output in bit-reversed order. 

Similarly, a bit-reversed input data sequence yields a 
transformed output sequence in normal order. 



The matrix W~ is a two-dimensional array representing 
the transform operation. It has row indices k=0 , 1 , . . . , N- 1 , 
and column indices n=0 , 1 , . . . , N-1 . Since the "twiddle factor" 
W = exp(-j2iT/N) is common to all the elements of the DFT 



matrix W , a matrix of exponents E can be written, whose 
elements are functions of k and n. For example, for N=8 , 
(2.1) gives 



E 



E 



X(0) = [W° W° W° W° W° W° W°] x(n) 
X(l) = [W° W^] x(n) 
X(2) = [W° W^] x(n) 



(2.5) 



X(7) = [W° W^] x(n) 
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In matrix form, 



X ( k ) =W X ( n ) = 



and E = 



0 

1 

2 

3 

4 

5 

6 
7 



X(0) 




o 

o 

o 

o 

o 

o 

o 




~x(oT 


X(1) 




w° 




x(l) 


X(2) 




W® W“ 




X(2) 


X(3) 








x(3) 


X(4) 


= 


w° w'^ w'^ 




x(4) 


X(5) 








x(5) 


X(6) 




w‘^ 




x(6) 


X(7) 








x(7) 



1 

0 

1 

2 

3 

4 

5 

6 
7 



2 

0 

2 

4 

6 

0 

2 

4 

6 



3 4 

0 0 



3 
6 
1 

4 
7 
2 

5 



4 

0 

4 

0 

4 

0 

4 



5 

0 

5 
2 
7 
4 
1 

6 
3 



0 0 



6 

4 

2 

0 

6 

4 

2 



( 2 . 6 ) 



(2.7) 



Note that each element in the matrix of exponents, E, is the 
product of the row label, k, and the column label, n, computed 
modulo N (i.e,, k*n mod N = remainder of k*n/N). The compu- 

n Ic 

tation of the kn products mod N is due to the fact that W 
is periodic with period N, i.e.. 
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y ( n+mN ) ( k+rN ) _ ^nk 



m,l = 0,+l, ... 



(2.7a) 



ri Ic 

The periodicity of W is one of the keys to the FFT 
[Ref. 4: p. 358]. In general, for an N-point DFT, the E 
matrix of exponents is NxN, vi^ith entries: 



E = 



0 

1 

2 



N-2 

N-1 



0 1 2 

0 0 0 

0 1 2 

0 2 4 



0 N-2 N-4 
0 N-1 N-2 



N-2 N-1 

0 0 
N-2 N-1 

N-4 N-2 



4 

? 



( 2 . 8 ) 



A. SHORTHAND NOTATION 

One way to derive FFT algorithms is to factor the DFT 
E 

matrix W into a product of matrices. This method is also 
well-suited to the description of distributed FFT algorithms, 
and will be the method used here. 

Elliott and Rao have developed a shorthand notation that 
makes the matrix factorization easy to follow [Ref. 2: 
pp. 47-49]. To illustrate, consider the 4-point DFT defined 
by : 
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X(k) 



k=0, 1 ,2,3 



(2.9) 



E 

= W X ( n ) 

n=0 ,1,2,3 

In matrix form, 



X(0) 




"wO 


w° 








x(o7| 


X(1) 














x(l) 


X(2) 












» 


x(2) 


X(3) 






w3 








x(3) 


— — 




— 













( 2 . 10 ) 



This can be manipulated into a decomposition-in-time (DIT) 

E 

FFT by reordering the columns of W~ as follows: 



X(O)' 




W° 










~X(0) 




1 


1 


1 


1 




^(oT 


X(1) 














x(2) 




1 


-1 


-j 


j 




x(2) 


X(2) 












• 


x(l) 




1 


1 


-1 


-1 




x(l) 


X(32 










_1 




_x(32 




1 


-1 


j 


-j 


, 


^x(3) 



( 2 . 11 ) 

where W = exp(-j2r/A) = -j , and the n index values of the 

input sequence have been correspondingly reordered as well. 

E 

The matrix W~ can be factored to yield 



E 

W' 





( 2 . 12 ) 



17 



h ?2 

where W and W 



will represent the first and second 



stages, respectively, of butterfly computations in the 
DIT FFT. Let 



?2 



Then , 



W 



?2 



and 



W 



?1 



o 1 
o 

• I 




“o 0 . r 


0 . 1 




0 2.. 


0.2. 


and Ej = 


0 0 


0 . 3 




0 2 





0 




o 




1 


0 


1 


O [ 


0 




0 






0 


1 


0 


-j 


yO 


0 




0 




1 


0 


-1 


0 






0 






__0 


1 


0 





1 

o 


yO 


0 


0 




— 

1 


1 


0 


0 




y^ 


0 


0 




1 ' 


-1 


0 


0 


0 


0 


yO 


yO 


“ 


0 


0 


1 


1 


_0 


0 


yO 


y^ 




_o 


0 


1 


-1 



(2.13) 



(2.14) 



where the shorthand notation of a dot (no entry) in the 

matrices of exponents corresponds to an element value 
?L 

of zero in the W matrices. Performing the matrix 
multiplication of (2.12) yields 
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o 

o 


0 


' 




0 


0 








0 


o 

O 






0 

r 


7 


0 


0 






E ~ ~ 1 

= w “*W 


wO 


9 

0 W“ 


0 




0 0 




wO 










0 


O 

O 






lo 

o 




wO 














U^+O 


^^0+0 




> 






w° 




wO+0 










wO 




















w° 










jjO+0 










wO 









(2.15) 



Note that W~ is factored such that only one nonzero entry 

?2 

per row of W is multiplied by only one nonzero entry of 

any given column of W . This result is true for all 

factored FFT matrices [Ref. 2; p. 48]. Since W is a 

common factor, the exponents simply add (modulo N) when the 

a b a ^ b 

matrix multiplication is performed, i.e., W -W =W 
Therefore, an easier way to carry out the matrix multiplica- 
tion of (2.12) is by the addition of appropriate exponents 
of (2.13). Let the shorthand notation 



I = 




?1 



(2.16) 
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mean the matrix of exponents derived from the sum of 
appropriate elements of the factored matrices of exponents 
of (2.13), where the elements to be added are defined by the 
row-times-column rule of matrix multiplication. Thus, 



o' 

o 

• 1 




~o 0 . r 


0 . 1 




0 2.. 


0.2. 




0 0 


0 . 3 




0 2 



~0+0 


0+0 


0+0 


0+0 




~0 


0 


0 


0 


0+0 


0+2 


1 + 0 


1 + 2 




0 


2 


1 


3 


0+0 


0+0 


2 + 0 


2+0 




0 


0 


2 


2 


_0+0 


0+2 


3+0 


3+2 




_0 


2 


3 


1 



(2.17) 



and the result is the same as the exponents of (2.15). It is 
usually much simpler to work with the matrices of exponents, 
than to actually perform the matrix multiplication. Since 
all the necessary information is contained in the matrices 
of exponents, this section will deal mainly with this 
representation. 



B. FACTORED-MATRIX REPRESENTATION OF SINGLE-PROCESSOR FFT 
(SPFFT) 

E 

Factoring the DFT matrix W~ of (2.2) and reordering the 
rows and/or columns leads to various FFT algorithms. The 
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factored FFT matrices represent individual stages of the 
particular algorithms. The concept of matrix factorization 
will be developed in this section to describe a 
single-processor, radix-2, DIT FFT. 

To compute an N-point FFT (for N a power of 2), there 
are log 2 N "computation" matrix factors, each having 2 
nonzero elements per row. Each computation matrix represents 
the butterfly computations carried out at a given stage. 

The number of nonzero elements per row depends on the radix 
base of the algorithm. For example, a radix-3 FFT compu- 
tation matrix contains 3 nonzero elements per row, radix-4 
matrices have 4 nonzero elements per row, etc. A flow 
diagram for an 8-point radix-2 DIT FFT is shown- in Figure 2.1. 
The vector-matrix equation describing a single-processor 
implementation of the 8-point DIT FFT is: 

E E„ E. 

X(k) = W x(n) = W'-^ W W x(n) k = 0,l,...,8 

n=0 , 1 , . . . , 8 

(2.18) 

where 3 butterfly (computation) stages are necessary to 

?1 ?2 

accomplish the radix-2 FFT. The matrix factors W , W , 

?3 

and W represent the computations performed at stages 
1, 2, and 3, respectively. Let X^(n) represent the output 
of the first stage of butterflies of Figure 2.1. Then, 
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(2.19) 



X^(0) 


= 


X°(0) 


+ 


W°X°(I) 


= 


W°x°(0) 


+ 


W°X°(1) 


X^(l) 


= 


x°( 0 ) 


- 


W°X°(1) 


= 


W°x°( 0 ) 


+ 


W^X°( 1 ) 


X^(2) 


= 


X°(2) 


+ 


W°X°(3) 


= 


o 

X 

O 




W°X°(3) 


X^(3) 


= 


X°(2) 


- 


W°X°(3) 


= 


W°X°(2) 


-h 


4 0 

W X (3) 


Xl(4) 


= 


X°(4) 


+ 


W°X°(5) 


= 


W°X°(4) 


+ 


W°X°(5) 


X^(5) 


= 


X°(4) 


- 


W°X°(5) 


= 


W°X*^(4) 




0 

W X^(5) 


X^6) 


= 


X°(6) 


+ 


W°X°(7) 


= 


W°X°(6) 


+ 


W°X°(7) 


X^(7) 




X°(6) 


— 


W°X°(7) 




W°X°(6) 




4 0 

W X^(7) 



where = exp [ -j ( 2ir /r ) ( 4 ) =- 1 , W^=l, and X'^(n) = x(n) is the 
bit-reversed input sequence, i.e., the input to the first 
butterfly stage. In matrix form, 



.0, 



X^(n) = X°(n) = 






w° w° 



w 



w 



w 



w 



w 



0 



w 



0 



w' 



w 



X°(n) 



(2.20) 
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Similarly, the output of the second stage of butterfly 
computations is: 



X^(0) 


= 


x^(0) 


+ 


W°X^(2) 


= 


W°X^(0) 


+ 


W°X^2) 


x^(l) 


= 


x^l) 


+ 


2 1 

W X (3) 


= 


W°X^(1) 


+ 


2 1 

W X (3) 


X^(2) 


= 


x^(0) 


- 


W°X^(2) 


= 


W°X^(0) 


+ 


W^X^(2) 


X^(3) 


= 


x^(l) 


- 


W^X^(3) 


= 


W°X^(1) 


+ 


W^X^(3) 


X^(4) 


= 


X^(4) 


+ 


W°X^(6) 


= 


W°x^(4) 


+ 


W°X^( 6) 


X^(5) 


= 


X^(5) 


+ 


W^X^(7) 


= 


W°X^(5) 


+ 


W^X^(7) 


X^(6) 


= 


X^(4) 


- 


W°X^(6) 




W°X^(4) 


+ 


W^X^(6) 


X^(7) 


= 


X^(5) 


- 


W^X^(7) 


= 


W°X^(5) 


+ 


W^X^(7) 



since = -W^ , and = -W^. In matrix form, 



X^(n) = W'^X^(n) = 



W' 

0 



w 

0 

0 



0 

0 



1 



w 

0 

0 



w 

0 



w 

0 

0 



0 



X^(n) 



( 2 . 22 ) 
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The output of the last stage is: 



x^(0) 


= 


X^(0)+W°X^(4) 


= 


W°X^(0) 


+ 


W°X^(4) 


x^(l) 


= 


X^( 1)+W^X^(5) 


= 


W°X^(1) 


+ 


1 2 

W X (5) 


X^(2) 


= 


X^(2)+W^X^(6) 


= 


W°X^(2) 




2 2 

W^X^(6) 


X^(3) 


=: 


X“(3)+W^X~(7) 


= 


W^X-(3) 


+ 


3 2 

W^X (7) 


X^(4) 


= 


X^(0)-W°X^(4) 




W°X^(0) 


+ 


W^X^(4) 


X^(5) 


= 


X^(1)-W^X^(5) 


= 


W°X^(1) 


+ 


W^X^(5) 


X^(6) 


= 


X^(2)-W^X^(6) 


= 


W°X^(2) 


+ 


W^X^(6) 


X^(7) 


= 


X^(3)-W^X^(7) 


= 


W°X^(3) 


+ 


W^X^(7) 



since = -W° , = -W^ , = -W^ , and = -W^ 



Again, in matrix form, 



X(k)=X^( N)=W'^X^( N)= 



0 0 
0 



W 



0 0 

0 0 

T,0 



0 



W 



0 



0 0 
0 0 



0 

W 

0 

0 



0 



0 W 

0 0 

0 0 

,,o 



0 0 

„0 



0 



w 



0 



0 



0 

0 



0 0 
1 



w 



0 0 
0 0 



0 0 
5 



0 



0 0 

..2 



0 

W' 

0 



0 0 

,,6 



0 



w 



Combining (2.20), (2.22), and (2.24) gives the overall 



representation for the 8-point DIT FFT : 



(2.23) 



X^(n) 



(2.24) 

matrix 
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(2.25) 



~3 2 ~3 ~2 1 ~3 ^2 ~1 0 

X(k) = W X^(n) = W -^W ^X^(n) = W -^W X (n) 



where X(k) is the normal-ordered output sequence, and X^(n) 
is the bit-reversed input sequence. The factored matrices 
of exponents corresponding to the stages of the DIT FFT of 
Figure 2.1 are: 




n 



0 

1 



0 0 
0 

0 



2 

0 



0 



0 



2 

3 



0 



4 



0 



4 



5 



6 



7 




6 

0 . 0 . 

0 . 2 

0.4. 

0 . 6 



(2.26) 
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^3 = 



0 

1 

2 

3 

4 

5 

6 



0 



0 



0 



0 0 



0 



1 

0 



1 1 



7 



Q ... 1 



(2.26) 



where once again a dot or no entry in the exponent matrices 

~L 

corresponds to an entry of zero in the W matrices. 

The column indices are discussed in a subsequent paragraph. 

Performing the matrix multiplication using the shorthand 

notation described earlier gives the result shown in 

Figure 2.2. Notice that the addition of exponents in (2.27) 

is carried out mod N (N=8). Also observe the sparseness of 

the matrix factors , E 2 , and E^. Each matrix factor 

contains only 2N entries, and only the final overall matrix 

2 

of exponents, E, contains numerical entries in all NxN=N 
elements . 

A few comments about the nature of the DIT FFT algorithm 
are in order. When an N-point DFT is divided into two DFT's 
of one-half the original size, the time sequence separation 
of the input of either of the smaller DFTs is increased by 2. 
This corresponds to dropping (decimating) alternating inputs 
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Figure 2.2 8-Point/l-Processor DIT FFT 
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of the original DFT; hence the name decimation in time. The 
idea behind the FFT is thus to break the original N-point 
sequence into two N/2-point sequences, the DFT’s of which 
can be combined to give the DFT of the original N~point 
sequence. Next, each N/2~point sequence is decomposed 
into two N/4-point sequences, and so on, until 2-point 
sequences result. The DFT of a one-point sequence is simply 
the sample itself [Ref. 5: p. 49]. The input sequence 
indices for an N-point DFT are aeparated by 1 unit, based 
on a normalized analysis period, (record length), P = 1 
second. This yields transformed output sequence frequencies 
separated by 1 Hz [Ref. 2: p. 60]. Thus, the column labels, 
n, of the factored matrices of (2.26) are separated by larger 
increments of time going from right-to-left in Figure 2.1. 
That is, the last stage, , has column labels separated by 
1 unit, E 2 has indices separated by N/4 = 2units, and the E^^ 
column labels are separated by N/2=4 units. Performing the 
matrix multiplications of (2.27) results in a time sequence 
mixing operation, in which the time labels in the matrices 
of exponents add. The matrices E^ and E 2 have sample periods 
0 and 1 or 0 and 2 units, respectively (based on an analysis 
period of P = 1 second). These periods mix so that E^ E 2 
has periods 0, 2, 1, and 3 seconds. Likewise, the time 
sequence periods 0 and 4 seconds in E^ combine to yield 
periods 0, 4, 2, 6, 1, 5, 3, and 7 seconds in the overall 
matrix, E. The overall DIT FFT matrix of exponents, E in 
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(2.27) is the same as the DFT matrix of exponents (2.7), 
except for a reordering of the columns. Therefore, the DIT 
FFT matrix can be viewed as a DFT matrix with reordered 
columns . 

C. FACTORED-MATRIX REPRESENTATION OF MULTI-PROCESSOR 
FFT (MPFFT) 

In this section the concept of matrix factorization will 
be extended to the description of distributed FFT algorithms, 
with consideration of the multi-processor FFT implemented on 
the Butterfly Network at M.I.T. Lincoln Laboratory [Ref. 6]. 

Distributed FFT algorithms can also be described in terras 
of row or column permutations of an overall DFT matrix, 
with the inclusion of matrices describing the transfer(s) 
of data among the several processors. Figure 2.1 can be 
partitioned to describe an 8-point FFT implemented using 4 
processors. The result of this partitioning is shown in 
Figure 2.3. 

Since the DIT FFT algorithm combines data separated by 
1 , 2 , 4 , . . . , N / 2 memory locations progressing through the 
butterfly stages of the algorithm, the required separation 
will eventually exceed the size (M=N/P) of the data block 
held within each processor. (N=nuraber of input data points, 
P-nuraber of active processors; both assumed to be a power of 
2.) At this point, and for each subsequent computation stage, 
an inter-processor data transfer must take place. Kirk and 
Verly introduced the terminology ’’pre-transfer” and 
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"post-transfer" butterflies to describe these multi-processor 
operations [Ref. 7: pp. 5-6]. The butterfly computations 
that are accomplished prior to any data transfers are called 
pre-transfer butterflies. Those which are carried out after 
data transfers are called post-transfer butterflies. 

Partitioning the N data points equally among P processors 
leads to L'"' = log^ (M=N/P) stages of pre-transfer butterflies, 
and log 2 P post-transfer butterfly stages. Since each post- 
transfer butterfly stage requires one data transfer, there 
are also log.^P transfer stages. Note that the number of 
transfers depends only on the number of processors, and not 
on the input data record length. 

The pattern of transfers used in reference 7 consists 
of an exchange of the lower half of the data in one processor 
with thre upper half of the data in another. (In this context, 
lower and upper refer to local memory addresses, and not to 
relative memory positions.) The destination processor in a 
given transfer is determined by the .cube^ function, vi^hich 
complements the i_^ bit of the source processor's binary 
address [Ref. 8: p. 224]. For example, consider processor 
Pg in a four processor (P=4) network. The processor's binary 
address is 00. The cubeg function is 01, and the cube^^ 
function is 10. The destination processors in the first 
data transfer are determined using the cubeg function, those 
in the second transfer by the cube^^ function. In general, 
the destination processors in the itji transfer are determined 
by the cube^^_j^^ function [Ref. 9: p. 74]. 
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To determine whether a processor transfers the upper or 
lower half of its data, processors are designated as 
"upper-half" or "lower-half" processors at each transfer 
stage. At the it_h transfer stage, the (i-l)^_t bit of the 
processor's binary address is inspected. If this bit is 
zero, the processor is designated as an upper-half processor, 
and it transfers the upper half of its data at this stage. 

A processor whose (i-l)£_t bit is one transfers the lower 
half of its data at the i_^ transfer stage [Ref. 9: p. 74]. 

For example, in Figure 2.3, at the first transfer stage (i=l), 
the Q th bits of processors Pq and ?2 are zero. Therefore, 

Pq and P 2 are upper-half processors; similarly, P^ and P^ 
are lower-half processors at this stage. 

The transfer pattern chosen in references 7 and 9 
is not unique, but leads to a regular structure that can be 
described analytically. When these data transfers are 
represented using matrices, the "transfer matrices" possess 
many interesting properties. As will be shown later, the 
data transfers can be described as permutations of an NxN 
identity matrix. Although single-processor algorithms 
generate output sequences in normal order when the input 
data is in bit-reversed order, the distributed-processor 
algorithm produces data in what reference 7 refers to as 
"pseudo-normal" order. 
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The distributed implementation of an 8-point DIT FFT with 
bit-reversed input data shown in Figure 2.3 can be described 
by the vector-matrix equation: 



?3 T2 ?2 ~1 

X(k)=WWWWWx(n) 



(2.28) 



8 

8 



k=0 , 1 , . 
n=0 , 1 , . 

where , E 2 and E^ are matrices of exponents describing 
the butterfly computations at the first, second, and third 
stages, respectively. Matrices of this type are referred 
to as "computation matrices". and ?2 are matrices 

describing the first and second data transfer stages, and 
are called "transfer matrices". To illustrate the generation 
of the transfer matrices, consider the first data transfer 
in Figure 2.3: 
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0 


0 


0 


0 


0 


0 


1 


0 




x^d) 


X3'(0) 




x^(l) 




0 


0 


0 


0 


0 


1 


0 


0 
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where the superscripts indicate stage number (prime denoting 
a transfer stage), the subscripts denote the processor number, 
and the argument specifies the multi-processor memory 
location. In shorthand notation. 



0 



0 

. 0 . . . 

0 



0 . . . 

. 0 . 
. 0 . 

0 



(2.30) 



since W = 1. 

Using the shorthand notation to carry out the matrix 
multiplication of (2.28) gives the result shown in Figure 2.4. 
Note the ordering of the output sequence X(k), indicated by 
the row label, k . One-half of the data in a given processor 
can be read out in natural order before it is necessary to 
move on to the next processor to read the next element of 
the output sequence. Two passes through the processor 
network are thus necessary to read out the entire output 
sequence. This is always true, regardless of the length 
of the output sequence or number of processors employed. 
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Figure 2.4 8-Point /4-Processor DIT FFT(BR Input) 
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The partitioned computation matrices are identical to 
the single-processor case for the stages prior to any data 
transfers (the pre-transfer butterflies). For the post- 
transfer butterfly stages, the rows of the computation 
matrices have been permuted vis-a-vis their single-processor 
counterparts. (For example, compare the row ordering of 
E 2 of Figure 2.4 with E 2 of Figure 2.2.) Notice that the 
exponents within any given row are identical, though the 
ordering of the rows is changed. This reordering is caused 
by the data transfers, and will be examined in detail in the 
next section. The entire purpose for data transfers is to 
get data points that need to be combined in subsequent 
stage butterflies into the same processor. This is necessary 
when the memory separation between points to be combined in 
a butterfly exceeds that of the data stored in a given 
processor. The most economical way to affect the necessary 
transfers is to transfer only one-half the data at any given 
transfer stage. There are several other constraints on 
potential transfer patterns; these will also be discussed 
in the next section. Finally, observe once again that the 
exponent entries in any computation matrix, E^^ , can be 
determined by the residue of kn modulo N, where k is the row 
label, and n is the column label. 
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D. PROPERTIES OF TRANSFER MATRICES 

At the heart of the multi-processor FFT (MPFFT) 
algorithm are the inter-processor data transfers that must 
occur. There are log 2 P inter-processor transfers required 
to compute an FFT implemented using P active processors. 

When the transfers are represented by matrices, the "transfer 
matrices" possess many interesting properties. These 
properties will be examined in detail in this section. The 
required data transfers permute the ordering of the output 
sequence, hence the rows of the FFT matrix. When the FFT 
is implemented on a single processor, the output is produced 
in normal order when the input data is in bit-reversed (BR) 
order. The bit-reversal is a consequence of the decimation- 
in-time FFT algorithm for computing DFTs. The multi- 
processor FFT algorithm produces "pseudo-normal" (PN) 
ordered output when the input sequence is in bit-reversed 
order. Thus, PN ordering is a characteristic of MPFFT 
implementations. This section will show how and why the PN 
ordering comes about. 

1 . Transfer Matrix Patterns 

An eight-port butterfly network to interconnect 
eight processors is shown in Figure 2.5. Reference 7 
describes in detail the topology and properties of the 
Buttterfly Network (BN). The transfers that take place on 
the BN are included in the vector-matrix FFT representation 
by additional matrix factors to be called "transfer matrices" . 
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Figure 2.5 Eight-Port Butterfly Network 



The transfer matrices represent the input-output connectivity 
of the BN, where each switch is regarded as a two-port 
network. The inputs to the BN come from output ports of the 
processors Pq, P^, 5 the network's outputs in turn 

are connected to the input ports of these same processors 
[Ref. 10: p. 1]. It is convenient to view the transfer 
matrices as permutations of an NxN identity matrix. An 
identity transfer matrix results in effectively no transfer, 
leaving all the data in unchanged locations. When 
representing the actual inter-processor data transfers, 
elements move off the main diagonal in a very regular pattern. 
One-half of the elements remain on the main diagonal. (Recall 
that only half the data is transferred at any given transfer 
stage.) The remaining elements move right (or equivalently, 
down) off the main diagonal or left (up) off the main diagonal. 
The elements move so as to maintain symmetry about the main 
diagonal. This is a convenient property, the consequences 
of which will be discussed later. 

The only parameters that must be known to determine the 
exact form of a particular transfer stage matrix are the 
record length, N, the number of active processors, P, and the 
transfer stage number, R. The number of active processors 
sets the number of transfer stages, log 2 P. The ratio 
N/P=M (the size of the data block in each processor) determines 
the number of elements that remain on the main diagonal, M/2. 

To illustrate the. exact pattern of a transfer matrix, consider 
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a 16-point DIT FFT implemented using 4 active processors. 

The four-processor system requires log2P=2 transfer stages. 

The transfer matrices describing the inter -processor transfer 
are shown in Figure 2.6. The NxN matrices are partitioned 
into blocks of dimension MxM . The first MxM block corresponds 
to processor Pq, the next block to Pj^ , etc. Labels are 
added indicating processors corresponding to given rows and 
columns. The local processor memory addresses are listed 
next to the processor addresses. The MxM blocks corresponding 
to individual processors are highlighted. Numerical entries 
not falling into a highlighted processor block (along the 
diagonal) describe the inter-processor transfers. The 
destination processor is indicated by the element’s row 
position and the source processor is given by the column 
position. "Upper-half" or "lower-half" processors are 
determined by which column(s) contain the entries that have 
moved outside the processor's block. If the right-most 
columns (larger local memory addresses) corresponding to a 
source processor contain the transferring entries, the 
processor is an "upper-half" processor at that stage. 
Similarly, left-most column transfer entires correspond to 
"lower-half" processors. Recall the determination of 
"upper-half" and "lower-half" processors discussed in the 
previous section. Processors with the same (i-l)s_t bit of 
their binary addresses exhibit the same pattern at the i th 
transfer stage. In addition, processors that exchange data 
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at the i_yi transfer stage differ only in the (i-l)s_t bit of 
their binary addresses. To illustrate these ideas, consider 
the transfer matrix of Figure 2.6. The entries in rows 4 
and 5 occur in columns 2 and 3, respectively, outside the 
4x4 block corresponding to Pq . Thus, during the first 
transfer stage, processor transfers the upper-half (memory 
locations 2 and 3) of its data to P^ (processor corresponding 
to rows where the entries appear). Recall that upper and 
lower refer to the local memory addresses, and not to relative 
position in memory. Note also that the 0_yi bit is the only 
bit vvrhere Pq's and binary addresses differ. (Bit value = 0 

corresponds to upper-half processors, and a 1 indicates 
lower-half processors.) 

The distance moved off the main diagonal for those 
elements shifting is given by the difference between single- 
processor memory separation of points combining in butterflies 
and the separation available in the local processors. Points 
participating in a butterfly inthe single-processor DIT FFT 
are separated by 1 , 2 , 4 , 8 , . . . , N / 2 memory locations at successive 
stages. In general, the separation between points participating 
in a butterfly at stage L is 2^ ^ . The maximum separation 
available in the MPFFT is M/2. Therefore, elements move 
( 2(1 1)) _ (in/2) places off the main diagonal when representing 
the transfer prior to the L th butterfly stage. In the 16-point/ 
4-processor example, the first transfer occurs prior to the 
third butterfly stage. Thus, elements shifting off the main 
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(4/2))= 2 places right or left/down 



diagonal move (2^^ 
or up. In summary, all the information about an 
inter-processor transfer stage can be determined by 
inspecting the corresponding transfer matrix. 

2 . Algorithm for Generating Transfer Matrices 

The regular patterns exhibited by the transfer 
matrices make them easy to generate. The row/column positions 
can be expressed as an ordered-pair (i,j). For an identity 
matrix (no transfer), all elements reside on the main 
diagonal and i=j . When actual inter-processor transfers are 
represented, the column index is permuted so that i^^j . The 
algorithm for obtaining the column indices as permutations 
of the row indices is as follows. First, the rows and 
columns are numbered from 0 to N-1 . Next , the row indices 
are represented as binary numbers. The column indices of 
elements describing inter -processor transfers are obtained 
by switching 2 appropriate bits in the binary representation 
of the row indices. When computing . an N-point MPFFT, the 
matrices are of dimension NxN. Thus, there are log 2 N bits 
necessary to represent the row/column indices in binary. 

Label these bits bj^ , bj^_ , . . . , b ^ , bg , where L = (log 2 N)-l. The 
bit whose label is (L*-l) is switched with the bit whose 
label depends on the transfer stage number. (L'''*=log 2 M=number 
of pre-transfer butterfly stages.) At the first transfer 
stage, for example, bit number (L'"'-l) is switched with the 
bit indexed one number greater. At the second transfer 
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stage, bit (L'^-1) is switched with the bit whose label is two 
greater, etc. This procedure generates the column position, 
j, in the ordered-pair (i,j) from the row position. For 
example, an 8-point DIT MPFFT implemented on 4 processors 
requires 3 bits to represent the row/column indices 
0,1, ...,7. These bits are labelled b 2 ,b^,bQ. Then 
L*=log 2 M=log 2 (N/P)=l . To determine the column positions 
of the transfer matrix entries, the row indices are permuted 
as described above. That is, for the first transfer stage, 
bits bgCL'^-l) and b are switched. At the second transfer 
stage, bits b^ and b 2 are switched. Thus, the transfer 
matrices for this example are obtained a shown in Figure 2.7. 
Note that switching the appropriate two bits at a given 
stage has no effect on the elements that remain on the main 
diagonal. For elements not transferring, the two appropriate 
bits are the same for the row index at the stage of interest. 
Consider now the powers of 2 that the bits represent. The 
distance shifted off the main diagonal is the difference 
represented by the bit-switching operation. For example, 
when switching b^ and b^ , the difference is 2^-2^=l. At 
this transfer stage, elements shifting move one place off 
the main diagonal. This is simply the difference between i 
and j when i^j . If this algorithm results in a larger 
column index ((i,j) j i), the element in the transfer 
matrix shifts right off the main diagonal. If the row 
index permutation results in i>j , the corresponding element 
shifts left. Right shifts off the main diagonal correspond 
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Column 



to "upper-half" processors, and left shifts correspond to 
"lower-half" processors. The bit-switching algorithm holds 
for any number of points and any number of processors. It 
always involves switching the bit labelled (L'^'-l) with a 
bit determined by the transfer stage number. Although this 
algorithm works for either DIT or DIF MPFFTs with bit-reversed 
input data sequences, it must be slightly modified to 
accommodate pseudo-normal input. This modification will be 
developed in a later section. 

3 , Orthogonality of Transfer Matrices 

All transfer matrices are symmetric. At each 
transfer stage, each processor retains half its data. The 
corresponding elements of the transfer matrix remain on the 
main diagonal. Each processor that transfers the upper-half 
of its data (transfer matrix elements shifting right) receives 
data from another -processor's lower half (elements shifting 
left). At each stage, elements shifting off the main diagonal 
all move the same number of places. This results in a symmetric 
permutation of the NxN identity matrix. 

Recall that entries in the transfer matrices only 
take on numerical values of 1(=W^) or 0(.=no entry). This 
property, coupled with the symmetry property, results in 
all transfer matrices being orthogonal. That is, 

T T 

(W“^)^ = R=1 ,2, . . . ,1082? (2.32) 
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Equivalently 



T p T p rp 

w" «(W~ ) = I (2.33) 



Or, in the shorthand notation, 
T p T = diag [0] 



(2.34) 



For example, consider again of the 8-point . 4-processor 
DIT MPFFT . Substituting into (2.33) gives): 
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Using shorthand notation, (2.34) gives: 



fo 




p 




p 71 
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... 0 ... . 




... 0 ... . 
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... 0 ... . 






. . . . 0 . . . 
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0 . 
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(2.35) 



(2.36) 
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These convenient properties will be put to use later. For 
now, it is sufficient to note that all transfer matrices, 
are orthogonal. 

4 . Pseudo -Normal Ordering 

It was previously shown that single-processor FFT 
algorithms can be viewed as row or column permutations of a 
DFT matrix. This section will demonstrate that multi- 
processor FFT implementations can be viewed as further 
row/column permutations of the same basic DFT matrix. The 
additional row/column permutations arise from the inter- 
processor data transfers required in the MPFFT algorithm. 

The inter-processor transfers also cause a reordering of the 
data sequences involved in the MPFFT. 

Inter-processor data transfers arise from the need 
to move partially-processed data points that will be 
combined in succeeding butterfly stages in the same processor. 
When computing a DIT MPFFT, this permutes the rows of 
succeeding computation matrix factors. It will be shown 
later that the inter-processor transfers similarly reorder 
the columns of the decimation-in-frequency (DIF) computation 
matrices. The exponent weighting factors within a given row 
or column remain unchanged; only the row or column ordering 
is changed. As would be expected, the row/column 
permutations also exhibit a regular pattern, based on the 
pattern of the transfer matrices. 
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The rows and columns of the matrix factors are 
labelled with values of k and n, respectively. The column 
labels, n, correspond to the ordering of the input data 
sequence, x(n). For now, only bit-reversed (BR) input 
sequences will be considered. The row labels, k, represent 
the ordering of the output sequence, X(k). Starting with 
the normal-ordered output that is produced by the SPFFT 
algorithm using BR input, this section describes how the 
normal ordering is transformed into "pseudo-normal" (PN) 
ordering by the inter-procesosr transfers on the Butterfly 
Network . 



Consider a 16-point BIT MPFFT implemented on an 
8 -processor Butterfly Network (BN). Three transfer stages 
(log 28 ) are necessary in this computation. The row labels, 
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Figure 2.8 illustrates which data points are combined in the 

butterfly computations at each stage. During the first 

(and only) pre-transfer butterfly stage, points x(0) and 

x(l) are combined in a butterfly in processor Pq. The second 

( 9-1 ) 

butterfly stage requires separation between points of 2 =2 

memory locations. This separation is not available within a 
given processor, so an inter-processor transfer must occur. 
After the first transfer, par tially-processed data points 
x(0) and x(2) reside together in Pq , and another butterfly 
computation is performed. This procedure continues until 
the computation is completed. The PN output sequence X(k) 
that results is in the order shown after the final transfer 
stage, T^. This pseudo-normal ordering depends only on the 
record length, N, number of active processors, P, and their 
ratio, M=N/P. The record length determines the total 
number of butterfly stages, log 2 N. The number of transfer 
stages (log 2 P) is set by the number of processors. The size 
of the data block in each processor, M, sets the number of 
pre-transfer butterfly stages (log 2 M=L^'‘) that can be computed 
prior to any transfers. The PN ordering is a function of the 
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number of transfers, hence processors. This is analogous to 
the SPFFT case, where bit-reversal is dependent on the record 
length, N. Note also from Figure 2.8 that one-half the points 
in each processor are output in normal order, after the final 
transfer. Thus, the output sequence can be rearranged by 
additional network transfers, or the processor local memories 
can be read cyclically to obtain the output data in normal 
order [Ref. 7: p. 18]. 

The reordering of the output sequence, from normal 
to PN ordering, can be derived by a bit switching algorithm 
similar to that for generating transfer matrix patterns. 

The bits to be switched at each stage are different in this 
case, however. The binary representation of the row labels, 
k, again requires log 2 N bits to describe the N-point output 
sequence. The bits switched this time always involve two 
adjacent bits, beginning again with the bit labelled (L'^-1) 
at the first transfer stage. The labels of the bits to be 
switched are then each incremented by one at each successive 
transfer stage. Figure 2.9 illustrates the algorithm for 2 
different values of P and M. Note in Figure 2.9 that only M/2 
values of k in each processor are changed at any given stage. 
Also observe that the starting bit to be switched at the 
first transfer stage is different for different values of M. 
This is because fewer active processors compute more pre- 
transfer butterfly stages, and require fewer transfers (less 
reordering). Consider again the powers of 2 represented by 
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Figure 2.9a Normal t.o Pseudo-Normal Ordering 

16-points/8-processors DIT MPFFT (L^-1)=0 
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Figure 2.9b Normal to Pseudo-Normal Ordering 

16-points/4-processors DIT MPFFT (L><«=-1)=1 



* Figure 2.9 Permutation (by stage) Normal to 
Pseudo-Normal Ordering 
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the bits. Since the first transfer in Figure 2.9a switches 
bits b^ and b^^ , the k values change by 2^-2*^=l. This 
explains why the bits to be switched are adjacent in this 
case. During each transfer, the network is preparing for 
butterflies requiring separation between points twice that 
available. Thus> bits representing successively higher 
powers of 2 are switched at succeeding stages. 

5 . Selection of Transfer Matrices 

The transfer patterns used in references 7 and 9 
are not unique. Other patterns of transfers were investigated, 
but had the disadvantage of structures for which the general 
pattern could not be easily discerned and described analytically 
[Ref. 9: p. 74]. As would be expected, the same situation 
exists regarding the transfer matrix factors describing dis- 
tributed FFTs on the BN. This section discusses the 
selection of appropriate transfer matrices to represent the 
inter-processor transfers on the BN. 

Several constraints on the tnake-up of potential 
transfer matrices can be identified. These constraints are: 

1) Entries take on values of 1 or 0 only. 

2) Only one entry per row or column is permitted. 

3) The toplogy of the Butterfly Network must be satisfied. 

4) Data points to be combined in the next butterfly stage 
must reside in the same processor after transer. 

5) One-half the entries remain on the main diagonal. 



54 



These constraints greatly reduce the number of NxN matrices 
that qualify as transfer matrices. The constraints also 
lead to the transfer matrices being symmetric, hence 
orthogonal. This convenient property will be exploited in 
the next section. 

Beginning with the NxN identity matrix, the elements 
are permuted according to the constraints listed above to 
determine "legal" transfer patterns. For the 8-point/ 
4-processor DIT MPFFT , this procedure yields only two 
matrices for each transfer stage that meet all the constraints. 
These matrices are shown in Figure 2.10. The transfer matrices 
actually used are shown in Figure 2.10a. The other matrices 
that satisfy all the constraints are shown in Figure 2.10b. 

When the transfer patterns of Figure 2.10b are used, two 
undesirable properties result. First, since the higher- 
numbered data points occupy the lower addresses in each 
processor memory, the butterflies that follow must be 
inverted, with the twiddle factor paterns upside-down. 
Additionally, following the second transfer, the weighting 
factor patern is scrambled from the order expected. The 
second undesirable result involves the ordering of the 
output sequence. Instead of normal, bit-reversed, or 
pseudo-normal ordering, all of which can be unravelled, the 
output is in indecipherable order: X(5) X(l) X(4) X(0) X(7) 

X(3) X(6) X(2), In addition to meeting all the constraints. 
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Figure 2.10 8-Poin t/4~Processor Transfer Patterns 



the transfer patterns used in reference 7 also lead to 
a consistent pattern in the twiddle factors in every 
butterfly stage. The PN ordering of the output sequence 
is easy to describe and unravel. Note also that the 
alternate transfer matrices are more difficult to generate. 
That is, the bit-switching algorithm presented in Section 
II. D. 2 does not work for the alternate transfer matrices. 

E. GENERALIZED MULTI-PROCESSOR FFTs 

Single-processor FFT (SPFFT) algorithms permute the 
row or column ordering of the DFT matrix v^hich they imple- 
ment. This is turn reorders the output (permuted rows) or 
input (permuted columns) data sequence. The reordering 
caused by SFFT algorithms changes normal-ordered input or 
output sequences into bit-reversed (BR) order. The multi- 
processor implementation of the decimation-in-time (DIT) 

FFT algorithm described in Section II. D. 4 has been shown 
to further reorder the rows of the basic DFT matrix. This 
further permutes the ordering of the transformed output 
sequence. As was shown, when using bit-reversed input data, 
the DIT multi-processor FFT produces a pseudo-normal (PN) 
ordered output sequence. Pseudo-normal ordering is thus a 
consequence of this multi-processor FFT (MPFFT) algorithm, 
as bit— reversed ordering is associated with SPFFT 
algorithms. This section describes the generalized 
factorization of MPFFT matrices to obtain the stage-by-stage 
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matrix factors. The DIT MPFFT (BR input/PN output) examined 
previously will be extended to represent the DIT MPFFT 
with PN input ordering, and to decimation-in-frequency (DIF) 
MPFFT algorithms using both BR- and PN-ordered input 
sequences . 

The overall MPFFT matrix of twiddle factor exponents, 
E , is obtained by forming the modulo N product of the 
appropriately ordered row and column labels [Ref. 2: p. 43]. 
The row and column labels represent the ordering of the 
output and input sequences, respectively. A BR input 
sequence to the MPFFT implemented using the Butterfly 
Network produces PN-ordered output. Similarly, a PN input 
data sequence produces transformed output in BR order. This 
is true whether a DIT or DIF MPFFT is being computed, 
although the matrix factors representing individual computa- 
tion stages differ. Thus, after calculating the BR and PN 
sequence ordering for the given number of data points and 
active processors, it is easy to write down the overall 
MPFFT matrix . 

Once the overall MPFFT matrix is obtained, it must 
be factored to represent the individual computation stages. 

As discussed previously, an N-point FFT requires a total 
of log 2 N butterfly stages. When implemented on the multi- 
processor Butterfly Network, there are L*=log 2 M pre-transfer 
butterfly stages. (Recall that N=record length, P=number 
of active processors, and M=N/P=number of points in each 
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processor.) Therefore, each computation stage is represented 
by one matrix factor, with one transfer matrix prior to each 
post-transfer butterfly computation matrix factor. Each 
matrix factor successively pre-multiplies the factor(s) 
representing the previous stage(s) when the matrix product 
is carried out. Thus, the general form of the MPFFT is: 



where all logarithms in the matrix indices are base two. 

The remainder of this section discusses how to form each 
of the matrix factors in (2.37). 

1 . PIT MPFFT with Bit-Reversed Input 

The MPFFT matrix, E, is formed by reordering the 
row and column indices as described previously, and then 
forming the modulo N products to give the individual 
elements. This matrix must then be factored to obtain 
the representation of each computation and transfer stage. 
The most illustrative case to consider is when L’”"=log 2 M=l . 

In this case, there is only one pre-transfer butterfly 
stage, and the maximum number of transfers and post-transfer 
butterfly stages. Each processor contains M=2 data points, 
and is thus capable of performing a 2-point DFT internally. 
At successive stages, each processor can compute one-half 
of a 4-point DFT, one-fourth of an 8-point DFT, etc.. 




post-transfer butterflies and 
transfers 



pre- transfer 
butterflies 



(2.37) 
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provided that appropriate data points are transferred to 
reside within each individual processor. Other situations 
involving more than 2 points per processor are easily- 
described in relation to the = l case. 

Each NxN matrix factor in (2.37) is first partitioned 

into blocks of dimension MxM. Each MxM block along the main 

diagonal represents an individual processor. The computation 

stage matrices E^^ contain entries only in the submatrix 

blocks along the diagonal. The entries in these blocks 

describe the butterfly computations taking place vifithin the 

individual processors. The transfer matrices T contain 

- R 

entries outside the blocks along the diagonal to describe 
the inter-processor data transfers. 

Consider once again the 8-point/4-processor DIT 
MPFFT using BR input of Figure 2.4. Note that there are a 
total of 3 (log 28 ) computation matrix factors. represents 

the first (pre-transfer) butterfly stage, and there are 
2 (log24) matrix factors representing the post-transfer 
butterfly stages, each preceded by a transfer matrix. 

Recall that the DIT algorithm decomposes an 8-point DFT into 
two 4-point DFTs , then into four 2-point DFTs . See Figure 2.11. 
This decomposition partitions the input time sequence into 
subsequences having even and odd indices . At each 
successive decomposition, the butterfly inputs are separated 
by larger increments in time. The first butterfly stage 
combines samples from the time sequence separated by 4 
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Figure 2.11 Block Diagram of partifcion/recombination of 8-poinb 
Decimat/ion-in-time (DIT) FFT algorithm 



units (N/2). Thus, the first matrix factor , , has column 

(time) indices 0 and 4. The ordering of the row indices, k, 
at the first stage is the normal ordering characteristic 
of the single-processor DIT FFT with bit-reversed input data. 
The row ordering is then permuted by each inter-processor 
transfer stage until the output sequence ends up in the PN 
ordering characteristic of the MPFFT on the Butterfly 
Network. The second butterfly stage in the DIT FFT combines 
points separated by two memory locations. In this example, 
this separation is not available within a given processor, 
so an inter-processor transfer must take place. This 
reorders the rows of the next computaton stage, and all 
successive stages. The algorithm for reordering the row 
indices was presented in Section II. D. 4, The time 
separation between points combining in the second butterfly 
stage is 2 units, so the column labels for this matrix 
factor (E 2 ) are 0 and 2. Note that each twiddle factor 
exponent in the blocks along the diagonal is still obtained 
by the modulo N product of the row and column indices. The 
third (final) computation stage marix factor, , is derived 
similarly. The row ordering is further permuted by the 
second inter-processor transfer. The row indices of the 
final computation stage are now in the PN order in which 
the transformed output sequence appears. The column labels 
are 0 and 1 , corresponding to the butterfly inputs ' 
separation of 1 unit. It can be verified by comparing 
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Figure 2,4 with Figure 2,2 that identical butterfly computa- 
tions take place at each stage. That is, each row in the 
computation matrix factors contains identical entries at 
any stage, only the ordering of the rows is changed in the 
MPFFT by the inter-processor transfers. 

An explanation of the column labels of the transfer 
matrices is also necessary. A column label of 0 indicates 
that the element in that column remains on the main diagonal 
(doesn^t represent a transfer at that stage). The negatively- 
valued labels indicate elements moving down off the main 
diagonal, and correspond to lower-half processors at the 
transfer stage in question. Similarly, posi t i vely-valued 
labels indicate elements moving up from the main diagonal 
(upper-half processors). The magnitude of the transfer 
matrices' column labels in the DIT algorithm corresponds to 
the time separation between points that are to be combined 
in the butterflies at the next computation stage. The row 
indices are permuted by previous transfers the same as the 
two indices of the computation matrix factors. 

The matrix factors in (2,37) are called sparse 
matrices because of the numerous zero entries in any row 
or column. The sparse matrices cut down the number of 
arithmetic operations required to compute the DFT [Ref. 2: 
p. 67], Multiplying out the matrix factors gives the 
permuted DFT matrix E as the result. Taking the products 
can be regarded as a mixing operation in which the column 



63 



time indices of the matrix factors add. Successive 
decomposition of the input time sequence results in 
having only time values of 0 and 4 units, E 2 having only 
time values of 0 and 2, and E^ having only time values of 
0 and 1. When the matrix factors are multiplied, the time 
values mix so that E^ has time values of 0,4,2, and 6 
units. Similarly, in taking ( E 2 ‘”'T 2 ’"'E 2 mixing 

results in the product having time values of 0 , 4 , 2 , 6 , 1 , 5 , 3 , 7 . 
This represents the ordering of the input sequence which was 
successively decomposed at each stage in the factorization 
of E to give the DIT algorithm. Thus, carrying out the 
DIT MPFFT (multiplying the matrix factors) can be viewed 
as a recombination of the input sequence that was decomposed 
in setting up the algorithm. The row indices of the E 
matrix are determined by the row ordering at the final 
computation stage, and represent the ordering of the 
transformed output sequence, X(k). 

The 8-point /4-processor example (L‘" = l) can be 
extended in several ways as follows. Since M=N/P, there are 
really only two variables in (2.37), the record length, N, 
and the number of active processors, P. The total number 
of butterfly stages must equal the number of pre-transfer 
and post-transfer butterfly stages. Therefore: 

log2N = log2M+log2P = L* + log2P (2.38) 
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Two situations can result in fewer transfer stages than the 
L*=l example just cited. Extending the input record 
length to a higher power of 2 while keeping the number of 
processors constant results in one or more additional pre- 
transfer butterfly stages. This situation will also increase 
the total number of butterfly stages, log 2 N. Also, decreasing 
the number of active processors (again, required to be a 
power of 2 ) for a constant input sequence length results 
in more pre-transfer butterfly stages. This situation can 
result when processors suffer faults, and there are no spares, 
and leaves the total number of butterfly stages unchanged . 

Both of these situations increase the number of pre-transfer 
butterfly stages by increasing the size of the data block 
within each processor, M. This makes each processor capable 
of performing more than one pre-transfer butterfly stage. 

To illustrate, consider again the 8 -point DIT MPFFT, this 
time using only 2 processors, as shown in Figure 2.12. 

There is still a total of 3 (log 28 ) butterfly stages. Now, 
however, there are 2 (log 2 ( 8 / 2 )) pre-transfer butterfly 
stages, and only 1 transfer stage and 1 post-transfer 
butterfly stage. Referring to the algorithm of Section 
II. D. 4 to calculate the PN ordering for L*=2 , the E matrix 

can be written down. Each of the matrix factors E. is 

L 

divided into submatrix blocks of dimension MxM. Now, 
however, the blcoks will be 4x4. Each processor contains 
enough data to compute 2 pre-transfer butterfly stages. 
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Figure 2.12 8-Point /2-Processor DIT FFT (BR Input) 
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Since during the pre-transfer stage(s) each processor is 
functioning independently (i.e., as a single-processor), 
the pre-transfer matrices are identical to the SPFFT case. 
Note also that all processors contain the same weighting 
factor pattern prior to any transers, regardless at which 
stage the first transfer occurs. 

When the submatrix blocks representing the 
individual processors are of dimension 4x4 or greater, 
there are dots, representing no entry, \7ithin each block. 

This is analagous to the single-processor case, in that 
some or all of the required memory separation between 
points is available within the processors^ local memories. 
When the stage is reached where inter-processor transfers 
are necessary, rows of the succeeding stage matrices are 
permuted, as always. The column labels representing 
time values of each matrix factor are such that each 
processor block contains an equal number of zero and non-zero 
time labels. Regarding the transfer matrices, one or several 
of the transfer stages required in the case are 

^bypassed” as the N/P=M ratio increases. The algorithm 
for generating the transfer matrices presented in Section 
II,D.2 still applies, but the number represented by 
(first bit to be switched) increases. 
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2 . PIT MPFFT with PN Input 

The PIT FFT algorithm can be used on a single 
processor with normal-ordered input, to produce bit- 
reversed output. Similarly, the PIT multi-processor 
algorithm can also be used with a pseudo-normal input 
sequence. As would be expected, the transformed output is 
produced in bit-reversed order. The MPFFT matrix E is the 
transpose of the E matrix of the BR input case. This 
reflects the change in th input-output relationship of 
the PN- and BR-ordered sequences. The elements of the E 
matrix are thus still obtained from the modulo N product 
of the row and column indices representing the output and 
input sequence ordering. Unfortunately, the matrix factors 
describing individual stages are not simply the transposes of 
the matrix factors from the BR input algorithm. This is 
due to the different ordering of the input sequence. The 
inter-processor transfers occur in reverse order, and for 
the case where (fewer than maximum number of transfers), 

even occur at different stages. Specifically, for an 
8-point/2-processor PIT algorithm, L*=log2M=2. When using 
PN input data, there is one transfer stage (log 2 P) as 
expected. However, the one transfer is required after the 
first butterfly stage, and not after two pre-transfer 
stages, as in the BR input case. There are two post-transfer 
butterfly stages, but only one transfer. That is, there is 
no second transfer required between the two post-transfer 
butterfly stages. (Compare Figure 2.13 with Figure 2.12). 
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Figure 2.13 8-Point/2-Processor DIT FFT (PN Input) 
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Again, the easiest case to consider is when L*=l . 
There is only one pre-transfer butterfly stage, and each 
post-transfer stage is preceded by an inter-processor 
transfer stage. In this case, though, since the Butterfly 
Network is transforming PN input into BR-ordered output, 
the transfer pattern is the reverse of that in the BR 
input case. The individual transfer matrices are the same, 
since they are symmetric, but they occur in reverse order. 
(Compare T^^ of Figure 2.14 and T 2 of Figure 2.4.) The column 
labels once again correspond to the time decomposition of 
the input sequence. For an 8-point DIT, the labels are 
0 and 4 at the first stage, 0 and 2 at the second stage 
and 0 and 1 at the final stage. The column indices of the 
transfer matrices are also the same as before. The non-zero 
magnitudes still indicate time separation at the next 
butterfly stage. Values of zero indicate non-transferring 
elements, and the sign gives direction of movement off the 
main diagonal. The row indices for the first computation 
matrix factor are in what will be called "pseudo-bit- 
reversed" (PBR) order. This is the output ordering that 
would result in a SPFFT using pseudo-normal-ordered input. 
Recall that the exact PN ordering is a function of the 
number of required network transfers, log 2 P. The same 
is true of PBR ordering, since the transfers ultimately 
permute the PBR ordering into bit-reversed-ordered 
output. Thus, the elements within each matrix factor can 
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still be calculated from the modulo N product of the row 
and column indices. The row ordering is permuted by each 
successive transer, as before. At the final computation 
stage, the row indices, k, are in bit-reversed order 
corresponding to the ordering of the transformed output 
sequence. The column labels of the E matrix can again 
be found by summing the time indices of each of the 
individual matrix factors. 

The situation becomes slightly more complicated 
when L*^l . For PN input, there is only one pre-transfer 
butterfly stage before inter-processor transfers become 
necessary. There will be the same number of transfer 
stages as before (BR input case), but they occur in 
reverse order. This leaves log 2 M=L* computation factors 
at the end without any transfer matrices between them. 

This is analagous to the BR input case, when there were 
L* computation stages prior to any transfers. 

Recall the discussion on PN ordering from Section 
II. D. 4. Each processor contains two subsequences of M/2 
points in normal order, each subsequence separated by 
N/2. Since the DIT algorithm combines points separated 
by N/2 at the first butterfly stage, the PN input case 
calculates M/2 2-point DFTs in each processor at the first 
computation stage. There are (M/2)-l dots (no entry) in each 
block submatrix corresponding to the local memory 
separation between points combining in the butterflies. 
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Refer again to Figure 2.13. The butterfly width remains 
constant for successive stages preceded by inter-processor 
transfers, and decreases when computation stages are no 
longer preceded by transfer. Again, recall the BR input 
case where butterfly widths grew to a maximum, determined 
by local memory contents, then remained constant as 
transfer stages occurred. Since the PN input sequence 
does not have points separated by N/4 (second stage DIT 
butterfly width) residing in any processor, the first 
transfer stage occurs after the first butterfly stage. 
Transfers occur until log 2 P stages have been carried out. 
The final L'^‘ butterfly stages do not require transfers 
between them, as mentioned above. Each transfer 
successively permutes the pseudo-bit-reversed row labels, 
until the bit-reversed output sequence emerges. 

3 . Pseudo-Bit-Reversed Ordering 

Further discussion of the nature of pseudo-bit- 
reversed (PBR) ordering is necessary. Just as normal 
ordering is permuted to PN ordering by the DIT MPFFT using 
BR input, PBR ordering serves as the starting point for 
PN input algorithms. Also, PBR order is the ordering of 
the output that would result from a SPFFT DIT algorithm 
using PN input ordering. The terms "PN" and "PBR order" 
are really meaningless when talking about single-processor 
implementations, since each is a function of the number of 
inter-processor transfers. However, if a particular 
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configuration is assumed, say 8 points/4 processors, then 
that particular PN ordering, when input to a SPFFT, would 
result in the corresponding (8/4) PBR-ordered output. In 
any case, the algorithm of Section II. D. 4, which generates 
PN from normal ordering, can also be used to generate PER 
ordering from bit-reversed order. To use the algorithm 
as presented, the bits must first be reordered to bit- 
reversed ordering. That is, instead of descending order, 
the bits are reordered in ascending order. For example, 
a 16-point data sequence requires 4 bits to describe the 
set in binary. These bits when bit-reversed are ordered 
b^, b^, b 2 » b^. Now the algorithm may be applied 

directly to generate pseudo-bit-reversed order, for a 
given number of processors. Recall that the algorithm 
switches bits L*-l and L'"' at the first transfer stage, 
then increments both bits at successive stages. Thus, at 
the second stage, bits L^' and L*+l are switched, and so on. 
The transfer matrices used for the PN input case occur 
in reverse order from the BR input case. Thus, this 
algorithm can be used in reverse, to give the stage-by-stage 
row indices, k , and ultimately, the BR-ordering of the 
output sequence. Refer to Table 1 for a display of these 
relationships. 
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TABLE 1: SEQUENCE REORDERING (DIT MPFFT) 
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4 . DIF MPFFT Algorithms 

As in single-processor algorithms, the multi- 
processor DIF FFT exhibits much duality in relation to the 
multi-processor DIT FFT. The MPFFT matrix E can again 
be viewed as a DFT matrix with rows and columns reordered 
corresponding to the output and input sequences. The E 
matrix of the DIF with PN input/BR output is the transpose 
of the E matrix of the DIT with BR input/PN output. The 
transpose of a product of matrices is the product of the 
transposed matrices, in reverse order. The stage matrix 
factors of the DIF with PN input are thus the transpose 
of the factors, in reverse order, of the DIT with BR 
input. The DIF algorithm partitions the output sequence 
into even and odd values of k, the frequency index. The 
row indices now represent frequency separation at the 
individual computation stages. That is, the first matrix 
factor, E^ , has frequencies 0 and 4. The units of 
frequency depend on the analysis period, or record length. 

The column indices start out in the order of the input 
sequence for the first stage factor. They are then 
reordered by each transfer stage, in the same manner that 
the row indices are permuted in the DIT algorithm. Thus, 
the column indices, n, represent the points of the input 
sequence that reside in a given processor at any given stage. 
The transfer matrices are the same as for the DIT, and occur 
in the same order as the DIT with the same input ordering. 



76 



That is, the DIF with PN input has the same transfer 
pattern as the DIT with PN input. The pattern is the 
reverse of the DIT with BR input. Since the transfer 
matrices are symmetric, the transpose operation leaves 
them unchanged. Thus, the individual factors of the DIF 
with PN input/BR output are each the transpose, in reverse 
order, of the factors of the DIT with BR input/PN output. 

Since the ordering is reversed, the magnitudes of the 
transfer matrices* row labels are given by the frequency 
separation at the previous stage. Recall that the 
magnitude of the column labels on the transfer matrices for 
the DIT are given by the time separation for the next 
stage. (See Figure 2.15 and compare with Figure 2.4.) Similar 
duality exists between the DIF with BR input/PN output and 
the DIT using PN input/BR output. The E matrices are the 
transpose of one another, and the stage factors of the DIF 
are the transpose of the factors of the DIT, taken in 
reverse order. (See Figure 2.16 and compare with Figure 2.14.) 

This section has presented the factorization of 
FFT matrices, to obtain their multi-processor representations. 
Following sections will investigate the implementation of 
other fast transforms on the Butterfly Network. 
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Figure 2.15 8-Point/4-processor DIF FFT (PN Input) 
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Figure 2.16 8-Point /4-Processor DIF FFT (BR Input) 
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III. OTHER FAST TRANSFORMS ON THE BUTTERFLY NETWORK 



This section uses the tools developed previously to 
investigate other fast transform algorithms for potential 
implementation on the Butterfly Network (BN). Section II 
used matrix representations to show that the radix-2 FFT 
algorithm is supported by the BN. That is, the radix-2 
FFT can be partitioned so that multiple processors can 
carry out the computation in parallel. The BN supports 
the required inter-processor transfers of partially- 
processed data, producing transformed output in a new, yet 
easily decipherable, order.. This section examines several 
other algorithms to determine their adaptability to 
parallel processing on the BN. The analysis will focus 
on the required inter-processor transfer patterns, and 
the output ordering produced from the Butterfly Network. 

A. FAST HARTLEY TRANSFORM (FHT) 

Bracewell has articulated a real, fast transform 
algorithm that serves all the purposes of the complex 
Fourier transform [Ref. 11]. It is called the Fast 
Hartley Transform (FHT) in honor of Hartley, whose 1942 
formulation of a real integral transform is the basis for 
the FHT. The FHT algorithm is faster than the FFT, and 
if desired, may even be used, with a little additional cost, 
to compute the DFT. 
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As shown previously with the DFT , the discrete Hartley 
transform may be viewed as a square matrix operating on an 
N-dimensional vector. The DHT matrix can then be factored 
to obtain the FHT algorithm, where each factor represents 
one stage of the algorithm. The step from the Hartley 
transform to the Fourier transform can also be expressed 
as a matrix multiplication and so the factorization of 
the Hartley matrix leads directly to a new factorization 
of the Fourier matrix [Ref. 11: p. 64]. 

1 . Single-Processor FHT (SPHFT) Algorithm 

The Discrete Harley Transform (DHT) is defined 
[Ref. 11: p. 27] as the sum of the cosine and sine transforms 
N-1 

X(k) 2 x(n) cas(2’^nk/N) , k = 0 , 1 , . . . , N- 1 

^ n = 0 

(3.1) 

where cas 6 = cos 0 + sin0 . The inverse transform is: 

N-1 ' 

x(n) = S X(k) cas (2’^nk/N), n = 0 , 1 , . . . , N- 1 (3.2) 

k = 0 

Note that only real arithmetic is used, and except for the 
factor N ^ in (3.1), the forward and inverse transforms 
are identical. Thus, for real input data, the DHT is also 
real valued, and the original sequence can be obtained by 
using the same transformation formula a second time. 

Equation (3.1) may be represented as a vector-matrix 
equation : 



81 



X(k) = H x(n) 



(3.3) 



The DHT matrix H can be factored to give the FHT , similar 
to the way FFTs are obtained from the factorization of a 
DFT matrix. The input sequence is again assumed to have 
length equal to a power of 2. Thus, the DHT matrix H is 
factored as: 

H = Lq 

where G = log^N. The factors L , s=l , 2 , . . . , log „N are 
z z 

called stage matrices, and represent the computations at 
each stage of the algorithm. The input sequence x(n) is 
in bit-reversed order, and yields normal-ordered output. 
Each of the factors in (3.4) are sparse matrices, 
contributing to the speed of the FHT algorithm. The FHT 
employs a "divide and conquer" strategy similar to the 
FFT. An N-point DHT is obtained as a linear combination 
of the outputs of two N/2-point DHTs, each in turn 
obtained as combinations of the outputs of two N/4-point 
DHTs, etc. The decomposition proceeds until 2-point 
sequences result. The DHT of a 2-point sequence is 
obtained simply as the sum and difference of the 2 data 
elements. To illustrate, applying the definition (3.1) 
for N=2 yields: 
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X(0) = t[x(0)(cos 0+sin 0)+x(l)(cos 0+sin 0) ] 



X(l) = j[x(0)(cos 0+sin 0 ) +x ( 1 ) ( cos -n- +sin )] (3.5) 

X(0) = Hx(0) + x(l)] 

X(l) = i[x(0)-x(l)] (3.6) 

In matrix form: 



X(0) 


_ 1 


~1 


1 




X(0) 


X(1) 

L. ^ 


— 2 

1 


1 

L. 


-1 




x(l) 



Reference 11 defines a general decomposition formula that 
produces the DHT by successive halving of the input 
sequence. Given the input sequence x(n), n=0 , 1 , . . . , N-1 , 
let the N-element subsequences [x(0) 0 x(2) 0 ...’^ and 
|x(l) 0 x(3) 0 • • •} have DHTs Xj^(k) and X.,(k) respectively. 
Then , 

X(k) = X^(k) + X2(k)cos(27rk/N)+X2(N-k)sin(2irk/N) 

(3.8) 

This structure corresponds to the factorization of the 
DHT matrix, H [Ref. 11: p. 85]. 

The second stages compute one or more 4-point 
DHTs, using two 2-point DHTs as inputs. The cosine and 
sine multipliers will be either 1, 0, or -1, since 
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(2-7rnk/N) = (27rnk/4) = (nk'n’/2), and only integer multiples 
of 7 t/ 2 are thus possible. Applying (3.1) for N = 4 and 
putting the results in matrix form gives: 



X(0) 




1 1 1 1 




x(0) 


X(l) 




1 1-1-1 




x(l) 


X(2) 




1-1 1-1 




x(2) 


X(3) 




1-1-1 1 

_ 




x(3) 



(3.9) 



The columns are rearranged to correspond to bit-reversed 
input, yielding: 



l(oT 




1 1 1 r 




"x( o7 


X(l) 




1-1 1-1 




x(2) 


X(2) 




1 1-1-1 




x(l) 


_X(3J 




j. -1 -1 1 




_x(32 



(3.10) 



Equation (3.10) can be factored, since half the terms can 
be computed as 2-point DHTs , as mentioned above. This 
factorization yields: 



x(oT 




X(l) 




X(2) 




_X(3) 





1 1 
1 1 
1 -1 

1 -1 





x(0) 




x(2) 




x(l) 




x(3) 



= L, x(n) 



(3.11) 
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It is easily verified by carrying out the matrix 
multiplication that (3.11) is equivalent to (3.10). 

(In this development, and in the future, the N ^ factor 
is neglected for clarity. It can be inserted at the 
end, if necessary for normalization.) 

The third and subsequent stage factors, representing 
8- (or more) element FHT stage matrices are more complicated. 
The inputs to the 8-point FHT are the outputs of two 
4-point FHTs . This recursive property for generating 
higher-order FHTs from the combination of lower-order 
FHTs never changes. What does change is the number of 
non-zero elements per row in the stage matrices. For the 
third and subsequent stages, there are three non-zero 
elements in each row. This means that three elements 
enter, with appropriate coefficients, into each output 
element. The third stage factor is: 



Cl Si 



= 



(3.12) 



K. 
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where C = cos(2Trn/8), S = sin(2'n’n/8) and K = 
n n n 

[ cos ( 27rn /8) + sin ( 2irn/8) ] . Note that the cosine elements 

and the I's are on diagonals parallel to the main diagonal, 

but the sine elements run the other way. As a result, 

the independent variable of the operand element that 

multiplies the sine coefficient decreases as n 

increases. This property is referred to as retrograde 

indexing [Ref. 11; p. 73]. The arguments of the sine and 

cosine terms are integer multiples of 4 , taking on a 

numerical value of either, 0, ±1, or ± 1/V 2T Putting 

numerical values in the matrix and carrying out the 

matrix multiplication gives the result shown in Figure 3.1. 

Again it is easy to verify that the product in Figure 3.1 

gives the same result as the DHT definition of (3.1). 

Conversion of the DHT to the DFT is accomplished simply 

by associating the even and odd parts of the DHT with the 

real and imaginary parts of the DFT. For a discussion of 

the mechanics of this operation, see reference 11, pp. 75-77 

and reference 12, p. 150. However, if the power spectrum 

is desired, it can be computed directly from the DHT by 

2 2 

evaluating [X(k)J +[X(N-1)] . Finally, note that since 
the FHT is a real transform compared to FFT , which 
is complex, the real and imaginary parts of any 
complex input data can be processed in parallel, 
respectively, with the FHT [Ref. 12: p. 147]. 
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Figure 3.1 8-Point/l-Processor DIT FHT (BR Input) 
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2 . Multi-Processor FHT (MPFHT) Algorithm 



The Fast Hartley Transform can be partitioned among 
several processors for parallel implementation on the 
Butterfly Network. The number of points per processor, 

M=N/P, must be greater than or equal to 4, since the 
third and successive computation stages (N>^ 8 ) combine 
3 elements at a time. Thus, there are at least two 
(log 2 M) pre-transfer stages for any N-point FHT. The 
recursive nature of the FHT yields inter-processor transfer 
requirements similar to the MPFFT. That is, after 
L'''-=log 2 M pre-transfer stages, there are transfer stages 
interspersed between the subsequent post-transfer computation 
stages. As always, the transfer stages permute the row 
ordering, and thus the ordering of the output sequence. 
Processors again retain half their data, and transfer half 
at any given stage. When more than one transfer stage is 
required (4 or more active processors), the patterns of 
the output ordering are apparently not so convenient as in 
the MPFFT. More desirable patterns than those to be 
presented may well exist. The difficulty lies with the 
fact that the amount of data transferred is always a power 
of 2 (one-half each processor's memory), while 3 points 
are combined in the third and successive computation stages. 

The easiest case to consider is an 8 -point FHT 
implemented using 2 processors. There are 3 (log 28 ) 
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computation stages; 2 pre-transfer computation stages 
(L* = log2(N/P)=2) , 1 transfer stage (log 2 P=l) and 

1 post-transfer computation stage. For this case: 

X(k) = L„ T, L„ x(n) (3.14) 

The product is computed in Figure 3.2 and can be compared 

with Figure 3.1. The transfer matrix is symmetric, and 

was chosen in keeping with the in ver tabili ty property of 

the DHT. Note that this transfer pattern causes bit- 

reversed output, when the input is also in bit-reversed 

order. Again, exactly one-half of each processor's memory 

is exchanged. This time, however processors exchange 

alternating memory locations, instead of the upper- or 

lower-half as in the MPFFT. Processor P retains its 

o 

even-labelled memory locations, and transfers its odd- 
labelled memory locations in exchange for processor P^'s 
even-numbered data. The order of computation after 
transfer is somewhat subjective, and was chosen to minimize 
the amount of buffering required within each processor. 

In this case, the requirement of only 2 additional memory 
locations per processor for buffering appears to be 
optimum . 

Extending the discussion to N=16 illustrates the 
features of the generalized multi-processor FHT. For a 
16-point data sequence transformed using the same 2 
processors, the vector-matrix equation extends to: 
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(3.16) 



X(k) = L 4 L 3 L 2 x(n) 

The recursive property of the FHT keeps the first 3 stage 
matrices L, , and of (3.13) the same, with their 

patterns repeating along the main diagonal. Thus, only 
the transfer stage matrix T. and the final 16-point 
computation stage need be discussed. The transfer 
pattern is chosen again so that is symmetric. The 
symmetry property, along with the property of only one 
non-zero element (whose value is unity) per row and 
column, makes the transfer matrices presented herein 
orthogonal. Processor .Pq once again transfers its odd- 
labelled memory locations (this time, 4 elements) to P^, 
in exchange for even-numbered element. The order 

of computation obviously determines output ordering. 

This will be discussed further in the next section. 
Matrices T. and L, are shown, as is their product, in 

^ j_ 

Figure 3.3. The fourth stage compuation matrix L, of 
the single-processor product • T^ is equivalent to the 

single-processor L. matrix with reordered rows. Since, 
for any given row, the column entries are thesarae 
both implementations perform the same computation, 
although the ordering of their outputs is different. 

Consider now the 16-point FHT implemented on a 
4-processor Butterfly Network. This illustrates the 
^maximum transfers” case for the FHT since more than 3 
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points per processor are required. This is not really a 
problem, since the multi-processor's advantage lies in 
the computations that can be done in parallel, whether 
prior to or after the required inter-processor transfers. 
Reference 11 also contains a program called FASTPERMUTE 
that speeds up the required reordering of the input 
sequence, from normal to bit-reversed order. This thesis 
assumes that a bit-reversed sequence is made available 
as the input to the multi-processor algorithms presented. 

The vector-matrix equation for the 16-point/ 
4-processor MPFHT is: 



X(k) = L, T„ l; L,T. L. L. x(n) 



(3.17) 



where part of the fourth computation stage is performed 
prior to the second transfer, for reasons to be explained 
below. The pre-transfer stage matrices, L. and L^, are 
identical to the single-processor versions, and duplicate 
the patterns of the 2- and 4-point FHTs as submatrices 
along the main diagonal. The first transfer stage T^ 
occurs at the same place as, and duplicates the pattern 
of, the 8-point/2-processor MPFHT. This should be 
expected, since both have the same value of L* = log 2 M=log 2 ( N /P ) , 
and it is this value that drives inter-processor transfer 
requirements. The first post-transfer computation stage 
is also identical to the 8-point/2-processor case, again 
with the element pattern repeated along the main diagonal. 
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These patterns were chosen in keeping with the recursive 
property of the single-processor FHT algorithm. That is, 
an 8-point FHT can be made up from the outputs of two 
4-point FHTs, each in turn derived from the outputs of 
two 2-point FHTs. This property is responsible for the 
lower-order computation matrices repeating as 
submatrices along the diagonal in the early stages of 
higher-order FHTs. 

The fact that the FHT combines 3 elements into 
one output element at the third and successive computation 
stages causes problems when more than 2 processors are 
used. Recall that the number of active processors, P, 
sets the number of inter-processor transfer stages (log 2 P). 
Thus, for 4 or more active processors, there are 2 or more 
transfer stages. Since the transfers occur after all the 
pre-transfer computations (for N=16 or a higher power 
of 2) the transfers occur between stages involving 3 
points in the computations. Therefore, an inherent 
incompatibility exists between the transfers, involving 
a number of points equal to a power of 2 (one-half of 
each processor's local memory), and calculations involving 
3 elements. Figure 3.4 illustrates that all elements that 
are other than 1 (multiples of sin(’^/8) and cos(-»r/8)) fall 
into the right-half of matrix L, (columns 8-15). The 

O 

non-unity coefficients multiply elements x (8) through 
X (15), which are outputs from stage L„. Note in L„ 
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of Figure 3.5 that all these elements occur in processors 
^ 3 * Thus, one way around the problem is to 
partially pre-compute the fourth stage prior to transfer. 
After the inter-processor transfer, stage L, finishes the 
16-point MPFHT computation, using only 2 elements in the 
calculation of each output point. 

Note that the partial pre-calculation (stage , 
in Figure 3.5 and Equation 3.17) leaves processors Pq and 
Pj^ idle at that stage. This inefficiency pays dividends 
in the final stage 4 computation (L^), since this stage 
now only involves sums and differences of 2 elements. No 
synchronization problems should be generated, since the 
next stage is a transfer stage, involving handshaking. 
Partial precomputaton is only necessary when 2 or more 
transfers are required, and arises in part due to the 
decisions made involving patterns of the first transfer 
and order of computation of the first post-transf er 
computation stage. The products of the third and fourth 
stages of the 16-point/4-processor MPFHT are shown in 
Figure 3.6. It can be verified by comparing Figure 3.6a 
with Figure 3.4 that the multi-processor implementation 
performs identical computations as the single-processor 
algorithm. The output is reordered in the multi-processor 
version, as indicated by the row reordering caused by 
transfer J 2 . The columns of the product T 2 , 
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Figure 3.6a .16 Poi n t /4 -Processor WPFUT 

(Stage 4 ) 
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are reordered as well. This reflects the fact that the 
input to the partial pre-computation stage (L^) comes from 
the output of stage . This sequence was permuted by 
the first transfer in the 4-processor implementation, T. . 
This is verified by comparing Figure 3.6b with Figure 3.1, 
which shows identical elements in any given row, with the 
rows of Figure 3.6b reordered due to transfer . 

Figure 3.6b can also be compared with Figure 3.2. Note 
that each factor in Figure 3.6b is a duplicate of the 
factors in Figure 3.2, as is the final product. Thus, 
the MPFHT retains the recursive property of the FHT , in 
that the initial stages of higher-order FHTs are duplicates 
of the outputs of lower-order FHTs, as desired. 

3 . Transfer Patterns and Output Ordering 

The single-processor FHT (SPFHT) accepts bit- 
reversed input, and generates a normal-ordered output 
sequence. The multi-processor FHT (MPFHT) implemented 
on the Butterfly Network also accepts input in bit-reversed 
order. The transformed output sequence, however, is 
permuted by the inter-processor transfers required. The 
permuted output is not the pseudo-normal (PN) ordering 
characteristic of multi-processor FFTs , for reasons that 
will be explained shortly. Normal-ordered output may 
again be obtained by additional network transfers, or by 
cyclically reading the processors' local memories. 
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Recall the earlier statement that the "maximum 
transfers" case of the MPFFT won't work with the MPFHT. 
That is, 4 or more points must be contained in each 
processor. The situation involving the maximum number 
of transfer stages in the MPFFT occurs when only 2 points 
are contained in each processor. Thus, the first transfer 
of the "maximum transfers" MPFFT case (L*=log 2 M=l) is 
effectively bypassed in the MPFHT "maximum transfers" 
case (L*=2). With this in mind, the algorithm of 
Section II. D. 2 for generating transfer matrices can be 
modified to generate the MPFHT transfer patterns. The 
modification to the "bit-switching" algorithm of 
Section II. D. 2 is necessary because of which memory 
locations are involved in the MPFHT transfers. Recall 
that the MPFFT transfers involved the upper- or lower-half 
of a processor's local memory. The MPFHT algorithm 
transfers the even- or odd-numbered local memory 
locatons, instead of adjacent upper- or lower-half 
memory locations. This all relates to the properties 
of counting in binary. That is, every even decimal 
number has a binary representation ending in 0, while 
every odd decimal number's last bit is 1 when expressed 
in binary. The modified algorithm for generating MPFHT 
transfer matrices is described below. 
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As in the MPFFT case, the MPFHT transfer matrices 
are viewed as permutations of NxN identity matrices. 

This procedure determines the column indices where I's 
appear, by switching appropriate bits in the binary 
representation of the row indices. The bits being 
switched differ from those of the MPFFT transfer matrix 
algorithm. Again, this is due to the fact that the half 
of each processor's memory transferring is now based on 
even- and odd-numbered local memory locations. Not 
surprisingly then, one of the bits to be switched is 
always the last, or 0 th bit, since this bit determines 
even or odd. (Bits are again labelled bj^ bj^_ ^ . . . b 2b ^ b^ , 
where L=( log2N ) - 1 . ) The first transfer stage swaps bit 
bg with bit , where L* = log 2M=log 2 ( N/P) . Successive 
transfer stage switch bQ with then with bj^.^_j^2» 

etc. For example, in the 16 -point/ 4 -processor 
implementation of Figure 3 . 5 , log2N = 4 and L'^= 2 . Thus, the 
binary row indices have bits labelled b2b2bjbQ. The 
first stage transfer matrix determines column 
position of elements by switching bQ with b2 in the row 
indices. The second transfer matrixT2 switches bQ and b^. 
Source and destination processors are identical to the 
MPFFT case, and can be calculated using the cube^^_j^ 
functions to find a destination, given a source. 
Additionally, at the ith transfer stage, the (i-l)st 
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bit of the processor's binary address determines whether 
a processor transfers its even or odd memory locations. 

If the (i-l)st bit is zero, the processor transfers its 
odd-numbered memory locations, and if the bit is one, 
the processor transfers even-numbered memory locations. 

As in the MPFFT with "upper-half" and "lower-half" 
processors, those processor's transferring their odd- 
numbered memory locations receive the even-numbered 
locations of another processor. This convenient 
property again makes all transfer matrices symmetric 
and orthogonal. See Figure 3.7 for a graphical display 
of this example. 

Discussion of the reordering of the output 
sequence can be simplified by distinguishing between bit 
postions and bit labels. Let the normal-ordered output 
sequence have binary representation with bit labels 
^3^2^1^0’ occupying bit locations (positions) b 2 b 2 b^bQ 
(N=16). Now the algorithm j ust . discussed generates 
column indices from row indices, when applied to bit 
locations. That is, each transfer stage starts anew 
assuming bit locations ^3^2^1^0 indices, then 

switches the two appropriate bits for that transfer stage 
to give column positions where I's occur. Since the 
transfer stage permutes the output ordering cumulatively, 
the bit labels are not restarted at successive transfer 
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Source Proesssore 




Source Proceesors 




Figure 3.7 



Transfer Matrices of 16-point/ 
4-Processor MPFHT 
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stages. For example, the 16 -point/ 4 -processor case begins 

with bit labels ^2b2bj^bQ in bit locations b2b2bj^bQ. The 

first transfer T, switches b and b« to give bit labels 

-'^1 o 2 

(output sequence) and bit locations (column indices) 

bob^b-ibo. The second transfer T^ again starts with row 
J (J i z 

ordering in bit locations b2b2bj^bQ, while the bit labels 
are left partially-reordered (b2bQbj^b2). Then, switching 
b^ and b^ gives column indices ordered bQb2bj^b2» while 
the output sequence is bQb2bj^b2. When the bits of the 
respective binary representations are permuted according 
to this rule, the correct column positions and output 
sequence order result. This procedure was used when 
labelling the figures in this section. The column 
indices correspond to input ordering at a given stage, and 
row indices give output ordering from that stage. 
Therefore, the row labels of the final computation stage 
give output ordering in all cases. 

To summarize, the Fast Hartley Transform is 
realizable on the multi-processor Buttrerfly Network. 
Several pertinent features of the implementation have been 
presented, but details of the features and properties of 
this new algorithm have of necessity been only partially 
covered. The interested reader is referred to reference 11 
for a more detailed discussion of this algorithm that has 
great potential for signal processing applicatons. 
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B. FAST WALSH-HADAMARD TRANSFORM (WHT) 

While the DFT uses sinusoids of harmonic frequencies 
as its basis functions, other transforms can be defined 
using other bases. The Walsh functions, named for 
J.L. Walsh who introduced them in 1923, are the basis 
functions for the Walsh-Hadamar d transform (WHT) 

[Ref. 2: p. 301]. Walsh functions can be generated 
recursively, are orthogonal, and form a closed set. 

Since they are square waves, Walsh functions take on 
only two values, namely +1 or -1 [Ref. 13: p. 225]. 

This property enables the WHT to be calculated using real 
number operations, involving only additions and 
subtractions. This section presents two of the several 
possible ordering schemes for calculating the WHT. 

After introduction of the single-processor implementation, 
the WHT will be examined for compatibility with the 
multi-processor Butterfly Network. Again, the emphasis 
will be on inter-processor transfer requirements and 
output ordering. 

1 . Single-Processor Walsh Ordered Transform (WHT)w 
Walsh functions can be rearranged to form 
different ordering schemes such as Walsh or sequency 
order, Hadamard or natural order (shown in Figure 3.8), 
Paley or dyadic order, and cal-sal order [Ref. 2: p. 303]. 
This section examines the transform derived from the 
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Sequency 



wal^ (k , t) 
k 




walw(k,t) represents the kth Walsh function 
in sequency order 



Figure 3.8a Walsh Functions in Walsh or Sequency Order 

- [Ref. 2: p. 304] 
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wal^(k,t) represents the kth Walsh function 
in natural order 



Figure 3.8b Walsh Functions in Hadamard or Natural Order 

[Ref. 2: p.305] 
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Walsh functions are described in terms of sequency, 
which combines the number of sign changes and frequency. 
Sequency is defined as the number of sign changes per unit 
interval : 



sequency = 



/ 

. i(Z.C.) 

i(z.c.+i) 



even Z . C . 
odd Z . C . 



(3.18) 



where Z.C. is the average number of zero crossings per 
unit interval [Ref. 2: p. 305]. 

Sampling of the Walsh functions at equally spaced 
sample points generates the Walsh-Hadamard matrices in 
corresponding ordering. Periodic sampling of the 
Walsh functions in sequency order at 8 equi-spaced 
sample points gives: 
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3 3 

where [Hw(3)] is thus the 2 x2 Walsh-Hadamar d marix in 
Walsh or sequency order. The rows represent Walsh 
functions whose sequencies are listed. Whereas W~ 
defines the DFT matrix for N=2*^,, [Hw(G)] defines the 

(WHT)w matrix [Ref. 2: p. 310]. The (WHT)w is thus 
defined as: 

X(k) = (l/N)[Hw(G)]x(n) (3.20) 

while the input data sequence can be recovered from the 
inverse transform: 

x(n) = [Hw(G)]X(k) (3.21) 

The same algorithm can be used, then, to generate the 

transform or its inverse, except for the (1/N) factor. 

One fast algorithm that results from the marix factorization 

of (3.19) is shown in Figure 3.9. This factorization 

2 

reduces computation requirments from N to Nlog 2 N additions 
and subtractions, where N is again the length of the input 
sequence. The algorithm represented in Figure 3.9 accepts 
bit-reversed input, and generates a transformed output 
sequence in normal order. Note also that this algorithm 
can be computed in-place, thus minimizing extra memory- 
required for buffering. The f actor ed-matri x equation 
corresponding to Figure 3.9 is shown in Figure 3.10. 

Again, each matrix factor in (3.22) represents one butterfly- 
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Figure 3.9 Signal Flowgraph 8-Point/l-Processor (WHT)w [Ref. 2: p.311] 









^0 4 2 6.1 5 3 7 i^O 4 2 6 1 5 3 7 



0 


1 1 ! ! 


0 


1 I I 

1 1 1 1 1 


1 


11 11 1 


1 


ii 111 




1 r 1 ■ 




1 — 1 — 1- — 


2 


1 1-1 1 1 


2 


1 1 1 11 


3 


i' -i' 1 

i; .^4 i 


3 


1 ^1 1 n 


4 


1 1 1 1-1 


4 


1 1 Ui 1 

1 1 1 


5 


1 1 1 

^ L 1 Ll 

1 In In 


5 


1 1 1 

ii 1 -ii 

r T n 


6 


1 It 1 1 

1 1 1 


6 


1 1 1 ri 


7 


1 1 1 1 1 


7 


1 11 1-1 




1 1 
1 -1 



iitP 4 
0 
1 
2 

3 

4 

5 

6 
7 



^3 

2 6 1 5 3 7 

I I 

I I 

I I 

-4- 






1 -3l 

1 1 1 



'll' 
I ^ ^ I 



I 1 



-1| 

■I* 






I 1 

!i ij 



4 2 6 1 5 3 7_ 




‘0 4 26 15 37 

— 1 : 1 — 


* 1 ' 
11 1 1 1 


0 


ll|lljll|ll 


.iL_.i'..iJ..i 


1 


l-l_l 1-1 1 1-1 j_l-l 


Ul 1 1 |-1 


2 


1-1 1-1 1 1 1-1 1-1 1 


1_1 -ll_ l[ -1 


3 


1 1 
1 1 1-1-1 j 1 1 1-1-1 


!-i h 1 ' 


4 


1 1 j-1-1 i-1-1 [ 1 1 


11 -1 -11 1 
1 1 r 


5 


1-1 1-1 1 j-i 1 1 1-1 
n 1 “T 


1 1 1-1 pi 


6 


1-1 1 1-1 j-i 1 )-i 1 


ij ij -ij -i] 


7 


_i 1 j 1 1 j-i-i 



(3.22) 



Figure 3.10 8-Point/l-Processor Fast WHTw (BR Input) 
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computation stage. The final product, W > is equivalent 
to [Hw(3)] in (3.19), with the columns reordered to 
correspond to the bit-reversed input. The rows of W 
are in the normal order in which the output appears. 

2 . Multi-Processor Walsh Ordered Transform (MPWHT)w 
The Walsh or sequency ordered transform (WHT)w 
can be partitioned among several processors for 
implementation on the Butterfly Network. An alternate 
(equivalent) signal flowgraph to enable partitioning is 
shown in Figure 3.11. It is easily verified that the 
flowgraph of Figure 3.11 represents identical computations 
to those of Figure 3.9. Note that multi-processor 
implementation of Figure 3.9 directly is not possible 
due to the ordering of points participating in the first 
butterfly stage. The flowgraph of Figure 3.11 allows 
points participating in first stage butterfly computations 
to be input to the same processors when implemented using 
more than one processor. The input data is normal- 
ordered, however, and the algorithm produces bit-reversed 
output. Thus the WHT matrix W is the transpose of the 
W matrix of the BR input/normal-ordered output case. 

.rw 

The stages perform the same computations, but in 
different order. Thus the matrix factorization represented 
by the signal flowgraph of Figure 3.11 represents an 
alternate factorization of the WHT matrix. This 
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igure 3.11 Alternate Signal Flowgraph 
8-Point/l-Processor (WHT)w 



alternate factorization is shown in Figure 3.12, and 
is the basis for the multi-processor implementation of 
the WHT described in the following paragraphs. 

The factorization of (3.23) shown in Figure 3.12 
allows the multi-processor Walsh ordered transform 
(MPWHT)w to parallel the development of the MPFFT 
presented in Section II. The length of the input sequence 
and the number of active processor are again assumed to 
be powers of 2. There are log 2 N computation matrix factors, 
each representing a butterfly computation stage. 

(N=input record length.) These computation matrices are 
again divided into L*=log 2 M pre-transfer stages and 
log 2 P post-transfer stages, with log 2 P inter-processor 
transfer stages. (P=number of active processors, 

M=N/P=number of points in each processor.) The 
"maximum transfers" case again occurs when L*=l and 
each processor contains only 2 data points. For the 
8-point WHT on 4 processors, the vector-matrix equation 
factors as: 

X(k) = W„ T. W_ Ti Wi x(n) (3.24) 

^ 

where the W matrices represent butterfly stages, the T 

/N./ 

matrices describe transfer stages, x(n) is the input 
sequence, and X(k) is the transformed output sequence. 

The factorization of (3.24) is shown in Figure 3.13 and 
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Figure 3.12 8-Point/l-Processor Fast WHTw (Normal Input) 
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re 3.13 8-Point /4-Processor Fast WHTw (Normal Input) 
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represents the mul ti -processor implementation of (3.23) 
from Figure 3.12. Note the effect of the multi-processor 
implementation on the ordering of the output sequence. 

The single-processor WHTw produces bit-reversed output 
when the input is normal-ordered (Figure 3.12), The 
transfers in the multi-processor algorithm reorder the 
output to pseudo-bit-reversed (PER) order. Note also 
that the transfer matrices are the same as the 8-point/ 
4-processor MPFFT with BR input (Figure 2.4). This 
convenient property means that the algorithm of 
Section II, D. 4, which transforms normal to pseudo-normal 
ordering, may be used to permute BR (single-processor 
output) into PBR ordering. A slight modification must be 
made to account for BR order as the starting point, 
rather than normal ordering. Whereas the bit positions 
are labelled 8-point normal-to-PN 

reordering algorithm, the bit positions must be 
relabelled correspond to BR-to-PBR reordering. 

This is the same idea as shown in Table 1, and results 
in the appropraite reordering when the "bit-switching" 
algorithm is applied. 

Just as with the MPFFT, the MPWHTw can also 
be implemented using 2 processors. In this case the 
equation is: 

X(k) = W„ Tt W. W. x(n) (3.26) 
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and, as in the. MPFFT, there are 2 pre-transfer butterfly 
stages, a transfer stage, and 1 post-transfer butterfly 
stage. Transfer matrix is again identical to of 
the 8-point/2-processor DIT MPFFT. The output is again 
in PBR order, though different from the 8-point/4-processor 
PBR order. Since there is only one transfer in the 
2-processor case, its output is less-reordered from the 
single-processor BR ordering. This is analogous to the 
PN ordering of the MPFFT, which was shown to also be a 
function of the number of transfers required. 

The MPWHTw may also be implemented using PBR input, 
to produce normal-ordered output. Using the signal 
flowgraph of Figure 3.11, now with PBR-ordered input, 
the 8-point/4-processor MPWHTw matrix factors as: 

- W 3 X 2 ' I 2 Xl’ i<") (3.27) 

where Xi ' ~X2 normal input algorithm and X2*“Xi 

of (3.24) and Figure 3.13. TheWHTw matrix V is the 
transpose of the product W from Figure 3.13. The 
individual stage factors are different, however. They 
represent the same computations at each stage, but in a 
different order, since the input ordering is changed. 

These ideas are illustrated in Figure 3.14. The 
situation is slightly more complicated when only 2 
processors, requiring only 1 transfer stage, are used 
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3.14 8-Point /4-Processor Fast WHTw (PER Input) 
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with a PBR-ordered input sequence. The vector-matrix 
equation in this case is; 

X(k) = W„ W„ T, W. x(n) (3.29) 

where the output is again produced in normal order. The 

transfer matrix T. is identical to T, of the 8-point/ 

~L ~L 

2-processor MPWHTw with normal-ordered input, but occurs 
after only one pre-transfer butterfly stage. The 
product of (3.29) is again the transpose of the W matrix 
of the normal input case, and the same calculations are 
performed at each stage, again in different order. Thus, 
for PBR input, there are log 2 P butterfly stages with 
log 2 P transfers between each, then log 2 M post-transfer 
butterfly stages. The Walsh ordered transform is easily 
implementable on the multi-processor Butterfly Network, 
and exhibits many of the same properties as the multi- 
processor FFT. It can be used with normal-ordered input, 
to yield pseudo-bit-reversed output, or vice versa. 

3 . Single-Processor Hadamard Ordered Transform (WHT)h 
Walsh functions can be rearranged to form 
Hadamard or natural ordering. The periodic sampling of 
these Hadamard-ordered Walsh functions yields Hadamard 
matrices, which can be generated recursively [Ref. 2: p. 314, 
13: p. 226] as follows: 
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[Hj^(O)] = [1] 

[Hj^(k+D] = 



[Hj^(k)] 



[Hj^(k)] 
-[Hj^(k) ] 



k = 0 , 1 , . . . , log2N 

(3.30) 



The Hadamard ordered transform is similar to (3.20), and 
is defined as: 



X(k) = (1/N)[H. (G)]x(n) (3.31a) 

^ n 

where G=log 2 N. The inverse (WHT)h is defined as: 

x(n) = [H.(G)]X(k) (3.31b) 

^ n ^ 

The Hadamard ordered transform is also referred to as the 
binary Fourier representation (BIFORE) transform. Fast 
algorithms can also be developed for the (WHT)h based on 
matrix factorizations. The Hadamard matrix for an 8-point 
transform can be formed by 3 successive applications of 
(3.30) : 
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This matrix can be factored into 3(log2N) matrix factors, 
each representing a computation stage of a fast WHTh 
algorithm. One particular factorization, using normal 
input and producing normal output, is shown in Figure 3.15. 

4 . Multi-processor Hadamard Ordered Transform (MPWHT)w 
Another method of factoring (3.32) is shown in 
Figure 3.16. Again, this factorization is not unique, 
but allows for efficient implementation of (3.32). The 
corresponding single-processor signal flowgraph is shown in 
Figure 3.17. This factorization accepts bit-reversed input, 
and generates bit-reversed output. This method is 
compatible with a multi-processor implementation, in 
that points participating in the first butterfly stage 
are adjacent in the input sequence. It is easy to verify 
that W from Figure 3.16 is equivalent to (3.32), with 
rows and columns reordered to reflect the changes in 
output and input ordering, respectively. 

The algorithm of (3.34) and Figure 3.17 can be 
partitioned for implementation on the multi-processor 
Butterfly Network. For an 8-point WHTh implemented using 
4 processors, the vector-matrix equation is: 

X(k) = W^ T„ W„ T. W, x(n) (3.35) 

where the input sequence, ^(n), is in BR ordering. The 
output is permuted into PBR order by the necessary 
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Figure 3.15 8-Point / 1 -Processor Fast WHTh (Normal Input) 
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Figure 3.16 8-Point / 1 -Processor Fast WHTh (BR Input) 



126 



o 






CSJ 

X 



CD 

X 



X 



ID 

X 



CO 

X 



N 

X 




x: 

H 



0 

w 

CO 

Q) 

0 
0 
u 

Dk 

1 



4 ^ 

C 

•H 

0 

Dh 

I 

X 

x: 

a 

u 

bO 

5 

0 



«3 

d 

bO 

• H 

X 



r-1 

X 

<D 

U 

d 

bO 



127 



inter-processor transfers. The matrix product of (3.35) 
is carried out in Figure 3.18. Note that the transfer 
matrices and are again identical to the 8-point/ 

^ 1 ^ L 

4-processor DIT MPFFT with BR input. The product is 
equivalent to the W matrix of the WHTh single-processor 
implementation with rows reordered, reflecting the 
reordering of the output. 

An 8-point WHTh implemented on 2 processors can 
be expressed as: 

X(k) = W„ T, W. W. x(n) (3.36) 

where T, is the same as T- of the Walsh-ordered MPWHT , 
again identical to T, of the 8-point/2-processor DIT 
MPFFT with BR input. In this case also, the output is in 
PBR order, as would be expected. Thus, the algorithms 
from Section II for transfer matrix generation and output 
reordering may be used in this case also. 

Another alternative factorization of the 
Hadamard transform is to use pseudo-normal input data. 

This option is desirable, since the output is produced 
in normal order, as the Walsh ordered transform produces 
normal output when the input sequence is in PBR order. 

The 8-point/4-processor MPWHTw using PN input is 
expressed as: 



X(k) 



il3 




W. 




W. x(n) 



(3.37) 
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3.18 8-Point /4-Processor Fast WHTh (BR Input) 
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and the product is carried out in Figure 3.19. Similar 
to the Walsh ordered transform when PBR input is used, 
the transfers of (3.37) occur in reverse order from 
the BR input case of (3.35). That is, T, ' of (3.37) is 
identical to T., of (3.35), as areT^' and T, . When fewer 
processors are used, the transfer matrices are also the 
same, and also occur in reverse order. This has the 
effect of reversing the relationship between pre- and 
post-transfer butterfly stages. Now there are log^P 
butterfly and transfer stages, followed by log 2 M 
post-transfer butterfly stages with no transfers between 
the post-transfer butterflies. 

In this section all normalizing factors of (1/N) 

r* 

for the forward transforms have been omitted to allow 
concentration on the transfer patterns and reordering 
of output sequences. They are easily included at the 
end, if necessary. 

The ideas presented in this section are easily 
extended to larger data lengths and more processors as 
follows. For input record length N, there are always 
log 2 N total butterfly stages. For P active 
processors, there are log 2 P butterfly stages with inter- 
processor transfer stages between each. These occur after 
L*=log 2 M pre-transfer butterfly stages (BR input), or 
prior to L* butterfly stages (PN input). The transfer 
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Figure 3.19 8-Point /4-Processor Fast WHTh (PN Input) 



131 



patterns and output reordering for the Walsh transform 
(Walsh or Hadamard order) parallel those of the multi- 
processor FFT. Thus, algorithms for FFT transfer 
matrix generation are applicable to the corresponding 
WHT transfer patterns. Reordering of output sequences 
from the single-processor WHT are likewise obtainable 
from the corresponding MPFT algorithm. 

Finally, the Walsh transform is very compatible 
with the multi-processor Butterfly Network. Both Walsh 
and Hadamard orderings have many of the same properties as 
the MPFFT when implemented on the BN. Thus, ideas from 
Section II are also applicable to the WHT. 
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IV. COMBINED ALGORITHM FOR TRANSFER MATRIX 
GENERATION AND MULTI-PROCESSOR OUTPUT 



ORDERING 

The algorithms presented in Section II for generating 
transfer matrices and output reordering can be combined into 
a single algorithm. To do this, a distinction must be 
drawn between column labels and output labels. Recall 
the algorithm of Section II. D. 2 for generating transfer 
matrices as permutations of identity matrices. In an identity 
matrix, I's occur only in those positions where the row and 
column indices are the same. For the chosen transfer 
patterns on the Butterfly Network, processors transfer 
one-half of their data at any given transfer stage. Thus, 
half the elements move off the main diagonal, and half 
remain. The algorithm for generating transfer matrices 
gives the column indices where I's appear, as a permutation 
of the binary representation of the (normal-ordered) row 
indices. For half the row indices, this involves switching 
a 1 and a 0, and the corresponding element moves off the 
main diagonal (representing transfer of a data point). For 
the other half of the row indices, the algorithm switches a 
1 and a 1 , or a 0 and a 0, and the corresponding elements' 
row and column indices remain equal. These elements remain 
on the main diagonal, and represent data points not 
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transferring at that stage. The transfer marices are 
independent, in the sense that each successive transfer 
matrix is generated by permuting the normal-ordered row 
indices to obtain the columns where I's appear at that 
stage . 

The effect of the transfers on the output ordering is 
cumulative, however. Recall that the multi-processor fast 
transform output can be viewed as a reordering of the 
corresponding single-processor output. For example, a 
single-processor DIF FFT with bit-reversed input produces 
normal-ordered output. Section II. C showed that the DIT 
multi-processor FFT output is produced in pseudo-normal 
order. Recall also that pseudo-normal order is a function 
of the number of active processors, P, hence the number of 
transfer stages. That is, the pseudo-normal ordering for 
a 16-point/8-processor DIT FFT is different from that of 
a 16-point/4-processor DIT FFT. In the first case, 

3 (log 28 ) transfers are necessary, while the 4-processor 
case only requires 2 transfer stages. Thus, the pseudo- 
normal ordering of the 8-processor case is permuted 
"farther" in some sense from the single-processor 
normal-ordered output. 

With the above background in mind, the algorithm for 
generating transfer matrices is extended to also give 
output reordering as follows. A distinction is drawn 
between column indices and output ordering indices. 
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These are both represented as binary numbers, and then 
permuted according to the algorithm of Section . II . D . 2 . 

Since the column indices are generated from the normal- 
ordered row indices of each transfer stage, normal-order 
is the starting point for generating each transfer matrix. 
The output ordering begins in the appropriate single- 
processor output ordering for the fast transform being 
computed, then is cumulatively permuted by successive 
transfers . 

Several examples to illustrate the combined algorithm 
are shown in Table 2. This table illustrates the transfer 
matrix/output reordering of the DIT multi-processor FFT. 
Recall that the single-processor DIT FFT with bit-reversed 
input produces normal-ordered output. Likewise, for 
normal-ordered input, the single-processor DIT FFT generates 
output in bit-reversed order. These orderings are the 
starting points for the output reorderings shown in 
Table 2. Note that the bit-switching algorithm is applied 
literally to the output indices* bit labels, regardless of 
what position the bit labels occupy at a given stage. Also 
note that due to the cumulative effect of transfers oh the 
output ordering, the first transfer of any example is the 
only time the column indices and the output indices are the 
same. It is also obvious from Table 2 that pseudo-normal 
and pseudo— bit-reversed ordering depend on the number of 
transfers necessary. The first transfer stage of the 
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TABLE 2: COMBINED ALGORITHM FOR GENERATING TRANSFER MATRICES 
AND OUTPUT ORDERING (DIT MPFFT) 
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16-point/8-processor example is not required when using only 
4 or 2 processors, and is effectively bypassed in the latter 
two cases. It can also be verified, by comparing Table 2 
with Table 1, that the output ordering produced is the same 
when using the combined algorithm. 

Care must be taken in applying this algorithm, depending 
on the type of fast transform being computed. The algorithm 
from Section II. D. 2 and generalized here, was developed 
for DIT FFT transfer patterns. It can be used, with slight 
modifications to the sequence of bit-switching, for all 
other fast transforms presented herein. The modifications 
are discussed in the sections presenting the various other 
fast transforms. 
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V, CONCLUSIONS 



Matrix factorization is a useful method of representing 
distributed fast transforms. While not suggested as a 
method of actually computing the transforms , this 
representation provides insight into which points combine 
during computation stages, and into the transfer patterns of 
the Butterfly Network. Additional insight is gained in 
terms of the output reordering caused by the inter-processor 
data transfers . 

Matrix partitioning and factorization have been widely 
used to describe single-processor fast transform 
algorithms. This method was shown to also be useful in 
describing distributed fast transforms. Additional matrix 
factors are required to represent the necessary inter- 
processor transfer stages. These ^transfer matrices^' may 
be generated by permutations of identity matrices for 
selected transfer patterns on the Butterfly Network. The 
transfer structure was shown to determine the ordering of 
the transformed output sequence. The output ordering can 
be described in terms of a permutaton of the single-processor 
output produced by a given fast transform algorithm. 

The Butterfly Network also supports the Fast Hartley 
Transform, and the Walsh-Hadamar d Transform. Transfer 



138 



patterns similar to those required for the multi-processor 
FFT can also be used for these other fast transforms. The 
output orderings produced are unique, but easily decipherable. 

Many equivalent signal flowgraphs can be constructed to 
represent any given fast transform algorithm. This 
flexibility also permits many equivalent matrix factorizations 
to be developed. Constraints include the need to minimize 
additional memory requirements for buffering, or computing 
"in-place" as much as possible, and the desire to produce 
an orderly and decipherable output sequence. Additionally, 
for distributed algorithms, an orderly pattern of inter- 
processor data transfers is also desirable. The Butterfly 
Network was shown to meet these constraints for the 
computation of distributed FFTs , and to conveniently 
support several other fast transform algorithms. 

Further work would be desirable on developing a general 
method for finding the matrix factors of the distributed form 
of a given fast transform algorithm. An optimal implementation 
of the FHT (possibly the one given in Section III. A. 2) on 
the Butterfly Network is also worthy of further research, 
as the FHT appears -to have a quite promising future. 
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